Get started with R & RStudio

# Get started with R and RStudio

## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine and Life Sciences
### Universitat Pompeu Fabra

<br>

## Fundamentals of Computational Biology
### BSc on Human Biology
### UPF School of Medicine and Life Sciences
### Academic Year 2024-2025

---

# Setup and background

* To follow these slides you need an installation of 
  [R](https://www.r-project.org) and [RStudio](https://posit.co/downloads).
  You should install R **first** and only once R has been installed,
  then you should install RStudio.  
  &nbsp;&nbsp; 
* You can find installation instructions in the
  [setup](https://funcompbio.github.io/setup/#r-and-rstudio) link on how to
  install R **and** RStudio in your system. Once R and RStudio are installed,
  you should be able to start RStudio by double-clicking on an icon like the
  one here below.

![](data:image/png;base64,#img/RStudioIcon.png)

---

# Setup and background

* To illustrate the use of R and RStudio, we will use the data files called
  `mostres_analitzades.csv` and `virus_detectats.csv` that were generated
  in the [first practical](https://funcompbio.github.io/practical1) from the
  infection surveillance system of Catalonia
  ([SIVIC](https://sivic.salut.gencat.cat) in its Catalan acronym).  
  &nbsp;&nbsp;
* If you don't have these files, please review that practical and generate them
  again. Once you have obtained those two files, copy them into a fresh new
  directory called `seminar4`.

---

# Starting R and RStudio

* RStudio is the most popular graphical user interface (GUI), or rather the most
  popular
  [integrated development environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment),
  for working **with** R. However, **RStudio is not R, RStudio runs R**. You can
  also run and work with R **without** RStudio.  
  &nbsp;&nbsp;
* If you need to work with R in a remote server, then either that remote server
  runs [RStudio server](https://posit.co/products/open-source/rstudio-server/)
  and you can connect to it through a web browser or, alternatively, you only
  have a text-based connection through a terminal window and then you **cannot**
  use RStudio but your can use R on the Unix command-line.  
  &nbsp;&nbsp;
* If for whatever reason, you cannot use RStudio, you can still follow these
  slides, skipping the parts that specifically refer to the GUI of RStudio.

---

# Starting R and RStudio

* The RStudio window is initially divided in three main panes:

.pull-left[
* **R shell / prompt:** where you can interactively type R commands.  
  &nbsp;&nbsp;
* **Environment / history:** where you can browse through the
  objects that are being created and the commands that you have
  typed in the R shell.
]
.pull-right[
![](data:image/png;base64,#img/RStudioFreshStart.png)
]

* **Files / plots / pkgs / help:** where you can navigate through
  the filesystem where RStudio is running and change the working
  directory, visualize plotting output, browse through the loaded
  packages and read help pages.

---

# Starting R and RStudio

* If you cannot start RStudio but you have installed R, you still can start
R by typing on the Unix shell command line:

* Your terminal window should then be running R and looking similar to the one
  below.

![:scale 70%](data:image/png;base64,#img/Rterminal.png)

---

# Quitting R and RStudio

* To quit R and RStudio you should type the following instruction in the
R shell:
  <pre>
  > q()
  </pre>
* You **should not** type the `>` character, since it corresponds to the R
  prompt and only indicates that the given instruction to the right of that
  character should be typed in the R shell. Normally, after that instruction
  R will ask:
  <pre>
  > q()
  Save workspace image? [y/n/c]:
  </pre>
* If you answer `y` then R will store all the objects you created in a
  hidden file called `.RData` and next time you start R, all those objects will
  be automatically loaded.  
  &nbsp;&nbsp;
* Unless you have a reason to save the workspace when quitting, **you should
  always answer `n` to that question**; answer `c` _cancels_ the quitting
  instruction.  
  &nbsp;&nbsp;
* In RStudio you can also quit R and RStudio by either closing the application
  window or through the `Quit` option in the _File_ or _RStudio_ top-level menu.

---

# R as a calculator

* The R shell can be directly used as a calculator, type the following
  instructions and figure out what the operators do:
<pre>
> 1+1
> 5-4
> 3*2
> 6/2
> 4%%3
> 2**3
> 2^3
</pre>

---

# R as a calculator

* Type the following and press enter:
<pre>
> 1+
</pre>
* You should have obtained the following output:
<pre>
&gt; 1+
+
</pre>
where the cursor is next to the plus sign (`+`) that has appeared in the
line below. This plus sign indicates that the expression you have written
is incomplete.  
&nbsp;&nbsp;
* This often happens when there is, for instance, a missing
closing parenthesis. In this situation you can do two things: (1) you
complete the instruction or (2) you press the `Esc` key, which will cancel
the instruction. Try cancelling this incomplete sum with the `Esc` key.

---

# RStudio contextual help

* Try to calculate the natural logarithm of 10 by typing:
<pre>
&gt; log(10)
</pre>
  If you are using RStudio, note that when you have typed the name of the
  function `log`, RStudio shows you a popup with contextual help, which
  you can use to choose among functions that have `log` as a prefix in their
  name.

![](data:image/png;base64,#img/RStudioPopUpContextualHelp.png)

---

# Getting and setting the working directory

* Whenever we want to read or write data files to a specific working directory,
we need to make sure that the default directory access or R is pointing to that
directory, just as with current working directory (CWD) in the
Unix filesystem.  
&nbsp;&nbsp;
* To find out the default working directory of R, you should call the `getwd()`
  function in the R shell:
  <pre>
  > getwd()
  </pre>
* If the returned path is not the working directory that we want, we can change
  it with the function `setwd(dir)` where `dir` should be the path that we want
  to set as working directory.  
  &nbsp;&nbsp;

---

# Getting and setting the working directory

* In RStudio, using the _Files_ pane, we can navigate through the file system
  to the directory we want to set as working directory and then click on the
  `More` pull-down menu and select `Set As Working Directory`.

![:scale 50%](data:image/png;base64,#img/RStudioSetWorkingDirectory.png)

* Using the function `setwd()`, or the RStudio _Files_ pane, change the working
  directory to the folder `seminar4` that you should have created at the
  beginning of this document, and where you have downloaded the files
  `mostres_analitzades.csv` and `virus_detectats.csv`.

---

# Reading CSV files

* We can read CSV files in R using the function `read.csv()`. Let's read the CSV
  file `virus_detectats.csv` as follows:

``` r
  > dat <- read.csv("virus_detectats.csv", stringsAsFactors=TRUE)
  ```
* Note that when writing the first letters of the filename, you can
  _autocomplete_ the rest of the filename by pressing the `TAB` key.  
  &nbsp;&nbsp;
* Next to the filename as a first argument, we also specified that we want R
  to treat character columns as a special kind of object called _factor_.  
  &nbsp;&nbsp;
* The `read.csv()` function in R is analogous to the `read_csv()` function
  in the
  [Python module _pandas_](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

---

# Reading CSV files

* The `read.csv()` function returns a `data.frame` object. You can verify it by
  typing

``` r
  > class(dat)
  [1] "data.frame"
  ```
* Figure out the dimensions of this `data.frame` object with the function `dim()`.  
  &nbsp;&nbsp;
* Examine the first and last rows of this `data.frame` object with the functions
  `head()` and `tail()`.  
  &nbsp;&nbsp;
* If you are running RStudio, go to the `Environment` pane
  and click on the small right triangle icon next to the object name and then
  over the object name.

---

# Reading CSV files

* A quick way to get a summary of the data stored in a `data.frame` object is
  by calling the function `summary()` with that object as argument. Call
  `summary()` on the previous `data.frame` object and you should be getting an
  output similar to this one:

``` r
> summary(dat)
 setmana_epidemiologica      any            data_inici         data_final   
 Min.   : 1.00          Min.   :2022   19/12/2022:  370   25/12/2022:  370  
 1st Qu.:14.00          1st Qu.:2022   12/12/2022:  367   18/12/2022:  367  
 Median :27.00          Median :2022   14/11/2022:  366   20/11/2022:  366  
 Mean   :27.46          Mean   :2022   21/11/2022:  363   27/11/2022:  363  
 3rd Qu.:41.00          3rd Qu.:2023   28/11/2022:  362   04/12/2022:  362  
 Max.   :52.00          Max.   :2023   30/01/2023:  335   05/02/2023:  335  
                                       (Other)   :12284   (Other)   :12284  
   codi_regio                nom_regio      codi_ambit  
 Min.   : 0.00   BARCELONA        :7681   Min.   :   0  
 1st Qu.:64.00   GIRONA           :1790   1st Qu.:6400  
 Median :78.00   CAMP DE TARRAGONA:1544   Median :7801  
 Mean   :70.76   LLEIDA           :1495   Mean   :7077  
 3rd Qu.:78.00   CATALUNYA CENTRAL:1211   3rd Qu.:7802  
 Max.   :78.00   TERRES DE L'EBRE : 389   Max.   :7803  
                 (Other)          : 337                 
              nom_ambit                 virus                 sexe     
 BARCELONA CIUTAT  :2940   Rinovirus       :3836   Dona         :7966  
 METROPOLITANA NORD:2925   SARS-CoV-2      :3382   Home         :6370  
 METROPOLITANA SUD :1816   Grip            :2508   No disponible: 111  
 GIRONA            :1790   Parainfluenza   :1312                       
 CAMP DE TARRAGONA :1544   Coronavirus humà: 831                       
 LLEIDA            :1495   Metapneumovirus : 768                       
 (Other)           :1937   (Other)         :1810                       
   grup_edat    index_socioeconomic    positiu     
 1 i 2  :1081   Min.   :-1.000      Min.   :1.000  
 45 a 49: 990   1st Qu.: 3.000      1st Qu.:1.000  
 40 a 44: 979   Median : 3.000      Median :1.000  
 35 a 39: 885   Mean   : 3.076      Mean   :1.113  
 50 a 54: 869   3rd Qu.: 4.000      3rd Qu.:1.000  
 30 a 34: 830   Max.   : 4.000      Max.   :8.000  
 (Other):8813                                      
```

---

# Subsetting rows of a data frame

**Exercise:** Using the previously loaded `data.frame` object, build a
vector of logical values (a so-called _logical mask_) in one-to-one
correspondence with the rows, where a position in the vector is `TRUE` if the
corresponding row contains data about the virus `SARS-CoV-2`; see section
see section on _Subsetting_ from
[this lecture](https://funcompbio.github.io/lecture6/#38).
If you store that _logical mask_ into an object called `mask`, the sum of its
truth values should give the following result:

``` r
> sum(mask)
[1] 3382
```

Using that logical mask, obtain a new `data.frame` object that includes only
data rows about the SARS-CoV-2 virus. Verify that the number of rows in the
new object matches the sum of `TRUE` values in the logical mask.

---

# Contingency tables

* A common operation on factor columns of a `data.frame` object is to cross
tabulate them, producing a so-called
[contingency table](https://en.wikipedia.org/wiki/Contingency_table). The
simplest contingency table consists of calculating the frequency distribution
of a single column factor:

``` r
> tab <- table(dat$virus)
> tab

Adenovirus        Bocavirus Coronavirus humà      Enterovirus 
             522              108              831              447 
            Grip  Metapneumovirus    Parainfluenza        Rinovirus 
            2508              768             1312             3836 
      SARS-CoV-2              VRS 
            3382              733 
```
* Note that the object `tab` is a *named vector*. You can extract the
  names with:

``` r
> names(tab)
 [1] "Adenovirus"       "Bocavirus"        "Coronavirus humà" "Enterovirus"     
 [5] "Grip"             "Metapneumovirus"  "Parainfluenza"    "Rinovirus"       
 [9] "SARS-CoV-2"       "VRS"             
```

---

# Contingency tables

* Often, we may want to look at relative frequencies, rather than absolute
  ones, also known as proportions. In the previous example, we can either
  divide by the sum or use the function `proportions()`:

``` r
> tab/sum(tab)

Adenovirus        Bocavirus Coronavirus humà      Enterovirus 
      0.03613207       0.00747560       0.05752059       0.03094068 
            Grip  Metapneumovirus    Parainfluenza        Rinovirus 
      0.17360006       0.05315983       0.09081470       0.26552225 
      SARS-CoV-2              VRS 
      0.23409704       0.05073718 
> proportions(tab)

---

# Contingency tables

* Interesting insights come often from looking at a multivariate frequency
  distribution obtained by cross-tabulating two or more factors:

``` r
> xtab <- table(dat$virus, dat$sexe)
> xtab
                  
                   Dona Home No disponible
  Adenovirus        255  254            13
  Bocavirus          51   54             3
  Coronavirus humà  480  344             7
  Enterovirus       197  245             5
  Grip             1281 1211            16
  Metapneumovirus   454  311             3
  Parainfluenza     762  539            11
  Rinovirus        2077 1732            27
  SARS-CoV-2       2004 1362            16
  VRS               405  318            10
```

---

# Contingency tables

* We can also easily obtain columnwise percentages by using the function
  `proportions()` with the argument `margin=2` and multiply them by 100:

``` r
> xtabpct <- proportions(xtab, margin=2)*100
> xtabpct
                  
                         Dona       Home No disponible
  Adenovirus        3.2011047  3.9874411    11.7117117
  Bocavirus         0.6402209  0.8477237     2.7027027
  Coronavirus humà  6.0256088  5.4003140     6.3063063
  Enterovirus       2.4730103  3.8461538     4.5045045
  Grip             16.0808436 19.0109890    14.4144144
  Metapneumovirus   5.6992217  4.8822606     2.7027027
  Parainfluenza     9.5656540  8.4615385     9.9099099
  Rinovirus        26.0733116 27.1899529    24.3243243
  SARS-CoV-2       25.1569169 21.3814757    14.4144144
  VRS               5.0841075  4.9921507     9.0090090
```

---

# Simple plotting

* The main function to plot data points in (base) R is the function `plot()`.
  Type the following call to the `plot()` function in the R shell:

``` r
> x <- 1:10
> plot(x, 2*x)
```

<img src="data:image/png;base64,#/data/genomics/rcastelo/doc/docencia/fbc/seminar4/docs/index_files/figure-html/unnamed-chunk-11-1.png" height="250px" style="display: block; margin: auto;" />
* Repeat again the last call adding the argument `type="l"`. What did this
argument change? Consult the
[help page of `plot()`](https://funcompbio.github.io/lecture6/#37) and find
out whether there a value for the parameter `type` that allows one to plot both,
dots and lines?

---

# Simple plotting

* We can compare the previous percentages of infecting viruses between
  men and women as follows:

``` r
> plot(xtabpct[, "Home"], xtabpct[, "Dona"], xlab="Home", ylab="Dona")
> abline(0, 1) ## draw a line where x == y
> text(xtabpct[, "Home"], xtabpct[, "Dona"], rownames(xtabpct))
```

<img src="data:image/png;base64,#/data/genomics/rcastelo/doc/docencia/fbc/seminar4/docs/index_files/figure-html/unnamed-chunk-12-1.png" height="250px" style="display: block; margin: auto;" />
* Can you spot what viruses show more distinct percentages between
  men and women?

---

# Simple plotting

**Exercise:** Using the `data.frame` object storing the data from the CSV file
`virus_dectectats.csv`, build another `data.frame` object excluding rows with
the value `No disponible` in the column `sexe`. Using this new `data.frame`
object, cross tabulate the columns `virus` and `sexe` and figure out how
percentages of men and women change within each different infecting virus.

---

# Session information

``` r
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /home/rcastelo/Soft/R/R-4.4.0/lib/R/lib/libRblas.so 
LAPACK: /home/rcastelo/Soft/R/R-4.4.0/lib/R/lib/libRlapack.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Madrid
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] digest_0.6.37     R6_2.5.1          codetools_0.2-20  fastmap_1.2.0    
 [5] xfun_0.48         cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1
 [9] rmarkdown_2.28    lifecycle_1.0.4   cli_3.6.3         sass_0.4.9       
[13] jquerylib_0.1.4   compiler_4.4.0    highr_0.11        tools_4.4.0      
[17] evaluate_1.0.1    bslib_0.8.0       xaringan_0.30     yaml_2.3.10      
[21] rlang_1.1.4       jsonlite_1.8.9   
```