class: title-slide, middle, center # Get started with R and RStudio ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc on Human Biology ### UPF School of Medicine and Life Sciences ### Academic Year 2024-2025 --- # Setup and background * To follow these slides you need an installation of [R](https://www.r-project.org) and [RStudio](https://posit.co/downloads). You should install R **first** and only once R has been installed, then you should install RStudio. * You can find installation instructions in the [setup](https://funcompbio.github.io/setup/#r-and-rstudio) link on how to install R **and** RStudio in your system. Once R and RStudio are installed, you should be able to start RStudio by double-clicking on an icon like the one here below. ![](data:image/png;base64,#img/RStudioIcon.png) --- # Setup and background * To illustrate the use of R and RStudio, we will use the data files called `mostres_analitzades.csv` and `virus_detectats.csv` that were generated in the [first practical](https://funcompbio.github.io/practical1) from the infection surveillance system of Catalonia ([SIVIC](https://sivic.salut.gencat.cat) in its Catalan acronym). * If you don't have these files, please review that practical and generate them again. Once you have obtained those two files, copy them into a fresh new directory called `seminar4`. --- # Starting R and RStudio * RStudio is the most popular graphical user interface (GUI), or rather the most popular [integrated development environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment), for working **with** R. However, **RStudio is not R, RStudio runs R**. You can also run and work with R **without** RStudio. * If you need to work with R in a remote server, then either that remote server runs [RStudio server](https://posit.co/products/open-source/rstudio-server/) and you can connect to it through a web browser or, alternatively, you only have a text-based connection through a terminal window and then you **cannot** use RStudio but your can use R on the Unix command-line. * If for whatever reason, you cannot use RStudio, you can still follow these slides, skipping the parts that specifically refer to the GUI of RStudio. --- # Starting R and RStudio * The RStudio window is initially divided in three main panes: .pull-left[ * **R shell / prompt:** where you can interactively type R commands. * **Environment / history:** where you can browse through the objects that are being created and the commands that you have typed in the R shell. ] .pull-right[ ![](data:image/png;base64,#img/RStudioFreshStart.png) ] * **Files / plots / pkgs / help:** where you can navigate through the filesystem where RStudio is running and change the working directory, visualize plotting output, browse through the loaded packages and read help pages. --- # Starting R and RStudio * If you cannot start RStudio but you have installed R, you still can start R by typing on the Unix shell command line: <pre> $ R </pre> * Your terminal window should then be running R and looking similar to the one below. ![:scale 70%](data:image/png;base64,#img/Rterminal.png) --- class: small-code # Quitting R and RStudio * To quit R and RStudio you should type the following instruction in the R shell: <pre> > q() </pre> * You **should not** type the `>` character, since it corresponds to the R prompt and only indicates that the given instruction to the right of that character should be typed in the R shell. Normally, after that instruction R will ask: <pre> > q() Save workspace image? [y/n/c]: </pre> * If you answer `y` then R will store all the objects you created in a hidden file called `.RData` and next time you start R, all those objects will be automatically loaded. * Unless you have a reason to save the workspace when quitting, **you should always answer `n` to that question**; answer `c` _cancels_ the quitting instruction. * In RStudio you can also quit R and RStudio by either closing the application window or through the `Quit` option in the _File_ or _RStudio_ top-level menu. --- # R as a calculator * The R shell can be directly used as a calculator, type the following instructions and figure out what the operators do: <pre> > 1+1 > 5-4 > 3*2 > 6/2 > 4%%3 > 2**3 > 2^3 </pre> --- # R as a calculator * Type the following and press enter: <pre> > 1+ </pre> * You should have obtained the following output: <pre> > 1+ + </pre> where the cursor is next to the plus sign (`+`) that has appeared in the line below. This plus sign indicates that the expression you have written is incomplete. * This often happens when there is, for instance, a missing closing parenthesis. In this situation you can do two things: (1) you complete the instruction or (2) you press the `Esc` key, which will cancel the instruction. Try cancelling this incomplete sum with the `Esc` key. --- # RStudio contextual help * Try to calculate the natural logarithm of 10 by typing: <pre> > log(10) </pre> If you are using RStudio, note that when you have typed the name of the function `log`, RStudio shows you a popup with contextual help, which you can use to choose among functions that have `log` as a prefix in their name. ![](data:image/png;base64,#img/RStudioPopUpContextualHelp.png) --- # Getting and setting the working directory * Whenever we want to read or write data files to a specific working directory, we need to make sure that the default directory access or R is pointing to that directory, just as with current working directory (CWD) in the Unix filesystem. * To find out the default working directory of R, you should call the `getwd()` function in the R shell: <pre> > getwd() </pre> * If the returned path is not the working directory that we want, we can change it with the function `setwd(dir)` where `dir` should be the path that we want to set as working directory. --- # Getting and setting the working directory * In RStudio, using the _Files_ pane, we can navigate through the file system to the directory we want to set as working directory and then click on the `More` pull-down menu and select `Set As Working Directory`. ![:scale 50%](data:image/png;base64,#img/RStudioSetWorkingDirectory.png) * Using the function `setwd()`, or the RStudio _Files_ pane, change the working directory to the folder `seminar4` that you should have created at the beginning of this document, and where you have downloaded the files `mostres_analitzades.csv` and `virus_detectats.csv`. --- # Reading CSV files * We can read CSV files in R using the function `read.csv()`. Let's read the CSV file `virus_detectats.csv` as follows: ``` r > dat <- read.csv("virus_detectats.csv", stringsAsFactors=TRUE) ``` * Note that when writing the first letters of the filename, you can _autocomplete_ the rest of the filename by pressing the `TAB` key. * Next to the filename as a first argument, we also specified that we want R to treat character columns as a special kind of object called _factor_. * The `read.csv()` function in R is analogous to the `read_csv()` function in the [Python module _pandas_](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). --- # Reading CSV files * The `read.csv()` function returns a `data.frame` object. You can verify it by typing ``` r > class(dat) [1] "data.frame" ``` * Figure out the dimensions of this `data.frame` object with the function `dim()`. * Examine the first and last rows of this `data.frame` object with the functions `head()` and `tail()`. * If you are running RStudio, go to the `Environment` pane and click on the small right triangle icon next to the object name and then over the object name. --- class: small-code # Reading CSV files * A quick way to get a summary of the data stored in a `data.frame` object is by calling the function `summary()` with that object as argument. Call `summary()` on the previous `data.frame` object and you should be getting an output similar to this one: ``` r > summary(dat) setmana_epidemiologica any data_inici data_final Min. : 1.00 Min. :2022 19/12/2022: 370 25/12/2022: 370 1st Qu.:14.00 1st Qu.:2022 12/12/2022: 367 18/12/2022: 367 Median :27.00 Median :2022 14/11/2022: 366 20/11/2022: 366 Mean :27.46 Mean :2022 21/11/2022: 363 27/11/2022: 363 3rd Qu.:41.00 3rd Qu.:2023 28/11/2022: 362 04/12/2022: 362 Max. :52.00 Max. :2023 30/01/2023: 335 05/02/2023: 335 (Other) :12284 (Other) :12284 codi_regio nom_regio codi_ambit Min. : 0.00 BARCELONA :7681 Min. : 0 1st Qu.:64.00 GIRONA :1790 1st Qu.:6400 Median :78.00 CAMP DE TARRAGONA:1544 Median :7801 Mean :70.76 LLEIDA :1495 Mean :7077 3rd Qu.:78.00 CATALUNYA CENTRAL:1211 3rd Qu.:7802 Max. :78.00 TERRES DE L'EBRE : 389 Max. :7803 (Other) : 337 nom_ambit virus sexe BARCELONA CIUTAT :2940 Rinovirus :3836 Dona :7966 METROPOLITANA NORD:2925 SARS-CoV-2 :3382 Home :6370 METROPOLITANA SUD :1816 Grip :2508 No disponible: 111 GIRONA :1790 Parainfluenza :1312 CAMP DE TARRAGONA :1544 Coronavirus humà : 831 LLEIDA :1495 Metapneumovirus : 768 (Other) :1937 (Other) :1810 grup_edat index_socioeconomic positiu 1 i 2 :1081 Min. :-1.000 Min. :1.000 45 a 49: 990 1st Qu.: 3.000 1st Qu.:1.000 40 a 44: 979 Median : 3.000 Median :1.000 35 a 39: 885 Mean : 3.076 Mean :1.113 50 a 54: 869 3rd Qu.: 4.000 3rd Qu.:1.000 30 a 34: 830 Max. : 4.000 Max. :8.000 (Other):8813 ``` --- # Subsetting rows of a data frame **Exercise:** Using the previously loaded `data.frame` object, build a vector of logical values (a so-called _logical mask_) in one-to-one correspondence with the rows, where a position in the vector is `TRUE` if the corresponding row contains data about the virus `SARS-CoV-2`; see section see section on _Subsetting_ from [this lecture](https://funcompbio.github.io/lecture6/#38). If you store that _logical mask_ into an object called `mask`, the sum of its truth values should give the following result: ``` r > sum(mask) [1] 3382 ``` Using that logical mask, obtain a new `data.frame` object that includes only data rows about the SARS-CoV-2 virus. Verify that the number of rows in the new object matches the sum of `TRUE` values in the logical mask. --- # Contingency tables * A common operation on factor columns of a `data.frame` object is to cross tabulate them, producing a so-called [contingency table](https://en.wikipedia.org/wiki/Contingency_table). The simplest contingency table consists of calculating the frequency distribution of a single column factor: ``` r > tab <- table(dat$virus) > tab Adenovirus Bocavirus Coronavirus humà Enterovirus 522 108 831 447 Grip Metapneumovirus Parainfluenza Rinovirus 2508 768 1312 3836 SARS-CoV-2 VRS 3382 733 ``` * Note that the object `tab` is a *named vector*. You can extract the names with: ``` r > names(tab) [1] "Adenovirus" "Bocavirus" "Coronavirus humà " "Enterovirus" [5] "Grip" "Metapneumovirus" "Parainfluenza" "Rinovirus" [9] "SARS-CoV-2" "VRS" ``` --- # Contingency tables * Often, we may want to look at relative frequencies, rather than absolute ones, also known as proportions. In the previous example, we can either divide by the sum or use the function `proportions()`: ``` r > tab/sum(tab) Adenovirus Bocavirus Coronavirus humà Enterovirus 0.03613207 0.00747560 0.05752059 0.03094068 Grip Metapneumovirus Parainfluenza Rinovirus 0.17360006 0.05315983 0.09081470 0.26552225 SARS-CoV-2 VRS 0.23409704 0.05073718 > proportions(tab) Adenovirus Bocavirus Coronavirus humà Enterovirus 0.03613207 0.00747560 0.05752059 0.03094068 Grip Metapneumovirus Parainfluenza Rinovirus 0.17360006 0.05315983 0.09081470 0.26552225 SARS-CoV-2 VRS 0.23409704 0.05073718 ``` --- # Contingency tables * Interesting insights come often from looking at a multivariate frequency distribution obtained by cross-tabulating two or more factors: ``` r > xtab <- table(dat$virus, dat$sexe) > xtab Dona Home No disponible Adenovirus 255 254 13 Bocavirus 51 54 3 Coronavirus humà 480 344 7 Enterovirus 197 245 5 Grip 1281 1211 16 Metapneumovirus 454 311 3 Parainfluenza 762 539 11 Rinovirus 2077 1732 27 SARS-CoV-2 2004 1362 16 VRS 405 318 10 ``` --- # Contingency tables * We can also easily obtain columnwise percentages by using the function `proportions()` with the argument `margin=2` and multiply them by 100: ``` r > xtabpct <- proportions(xtab, margin=2)*100 > xtabpct Dona Home No disponible Adenovirus 3.2011047 3.9874411 11.7117117 Bocavirus 0.6402209 0.8477237 2.7027027 Coronavirus humà 6.0256088 5.4003140 6.3063063 Enterovirus 2.4730103 3.8461538 4.5045045 Grip 16.0808436 19.0109890 14.4144144 Metapneumovirus 5.6992217 4.8822606 2.7027027 Parainfluenza 9.5656540 8.4615385 9.9099099 Rinovirus 26.0733116 27.1899529 24.3243243 SARS-CoV-2 25.1569169 21.3814757 14.4144144 VRS 5.0841075 4.9921507 9.0090090 ``` --- # Simple plotting * The main function to plot data points in (base) R is the function `plot()`. Type the following call to the `plot()` function in the R shell: ``` r > x <- 1:10 > plot(x, 2*x) ``` <img src="data:image/png;base64,#/data/genomics/rcastelo/doc/docencia/fbc/seminar4/docs/index_files/figure-html/unnamed-chunk-11-1.png" height="250px" style="display: block; margin: auto;" /> * Repeat again the last call adding the argument `type="l"`. What did this argument change? Consult the [help page of `plot()`](https://funcompbio.github.io/lecture6/#37) and find out whether there a value for the parameter `type` that allows one to plot both, dots and lines? --- # Simple plotting * We can compare the previous percentages of infecting viruses between men and women as follows: ``` r > plot(xtabpct[, "Home"], xtabpct[, "Dona"], xlab="Home", ylab="Dona") > abline(0, 1) ## draw a line where x == y > text(xtabpct[, "Home"], xtabpct[, "Dona"], rownames(xtabpct)) ``` <img src="data:image/png;base64,#/data/genomics/rcastelo/doc/docencia/fbc/seminar4/docs/index_files/figure-html/unnamed-chunk-12-1.png" height="250px" style="display: block; margin: auto;" /> * Can you spot what viruses show more distinct percentages between men and women? --- # Simple plotting **Exercise:** Using the `data.frame` object storing the data from the CSV file `virus_dectectats.csv`, build another `data.frame` object excluding rows with the value `No disponible` in the column `sexe`. Using this new `data.frame` object, cross tabulate the columns `virus` and `sexe` and figure out how percentages of men and women change within each different infecting virus. --- class: small-code # Session information ``` r > sessionInfo() R version 4.4.0 (2024-04-24) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS: /home/rcastelo/Soft/R/R-4.4.0/lib/R/lib/libRblas.so LAPACK: /home/rcastelo/Soft/R/R-4.4.0/lib/R/lib/libRlapack.so; LAPACK version 3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: Europe/Madrid tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] digest_0.6.37 R6_2.5.1 codetools_0.2-20 fastmap_1.2.0 [5] xfun_0.48 cachem_1.1.0 knitr_1.48 htmltools_0.5.8.1 [9] rmarkdown_2.28 lifecycle_1.0.4 cli_3.6.3 sass_0.4.9 [13] jquerylib_0.1.4 compiler_4.4.0 highr_0.11 tools_4.4.0 [17] evaluate_1.0.1 bslib_0.8.0 xaringan_0.30 yaml_2.3.10 [21] rlang_1.1.4 jsonlite_1.8.9 ```