Exploring data with R and RStudio

class: title-slide, middle, center

# Exploring data with R and RStudio

## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine and Life Sciences
### Universitat Pompeu Fabra

<br>

## Fundamentals of Computational Biology
### BSc in Biomedical Sciences
### UPF School of Medicine and Life Sciences
### Academic Year 2025-2026

---

# Setup and background

* To follow these slides you need an installation of 
  [R](https://www.r-project.org) and [RStudio](https://posit.co/downloads).
  You should install R **first** and only once R has been installed,
  then you should install RStudio.  
  &nbsp;&nbsp; 
* You can find installation instructions in the
  [setup](https://funcompbio.github.io/setup/#r-and-rstudio) link on how to
  install R **and** RStudio in your system. Once R and RStudio are installed,
  you should be able to start RStudio by double-clicking on an icon like the
  one here below.

![](data:image/png;base64,#img/RStudioIcon.png)

---

# Exploring the CatSalut registry data

* The Catalan health service (CatSalut) is organised in several health regions
  (called *"Regió Sanitària"* in Catalan). Each health region is further
  subdivided into several health areas (called *"Àrea Bàsica de Salut"* in
  Catalan). See
  [here](https://catsalut.gencat.cat/ca/coneix-catsalut/catsalut-territori/regions-sanitaries)
  for more information about the health regions and health areas in Catalonia.  
  &nbsp;&nbsp;
* The data describing the reference population of each health area is available in a
  central population registry maintained by CatSalut, and which can be accessed by going
  to https://sivic.salut.gencat.cat/dades_obertes and clicking on the link
  called
  [Població de referència](https://analisi.transparenciacatalunya.cat/Salut/Registre-central-de-poblaci-del-CatSalut-poblaci-p/ftq4-h9vk).

---

# Exploring the CatSalut registry data

* Create a new directory called `seminar5`, where to store the data files that
  we will be working with in this seminar.  
  &nbsp;&nbsp;
* We will download the data of the reference population of the CatSalut.
  However, the whole data set is too big because it contains data since 2018.
  Instead, we will download only the data from 2025 using the following steps:  
  1. Go to the central population registry of CatSalut.
  2. Click on the `Dades` tab.
  3. Hover the mouse pointer on the first column called `Any`, and click
     on the hamburger menu that will appear on the right.
  4. Click on the funnel icon to filter the data by year.
  5. Write the value `2025` under the word `Iguala`.
  6. Click on `Exportar` and in the popup window, leaving the default options
     untouched, click on `Descarregar`.
* Copy the downloaded the file into the `seminar5` directory and
  rename it as `RegistreCatSalut2025.csv`. If for some reason you cannot
  download the file, you can download it from
  [this link](RegistreCatSalut2025.csv) and copy it into `seminar5`.

---

# Exploring the CatSalut registry data

* The data from the CatSalut population registry is in the CSV file
  `RegistreCatSalut2025.csv`. Read it into R and explore its contents.
  Try to figure out what the different columns mean and what kind of
  data they contain.  
  &nbsp;&nbsp;
* Using the columns `població.oficial` and `Regió.Sanitària`, calculate
  the total population of each health region in Catalonia (**tip:** use
  the R function `aggregate`), and replace the column name `x` in the
  resulting `data.frame`  object by `Total` with the instruction
  `colnames(pbyr)[2] <- "Total"` if `pbyr` was the name of the
  resulting `data.frame` object. Once you have obtained this calculation,
  try to answer the following questions:
    1. Which is the most populated health region?
    2. Which is the least populated one?
    3. Calculate the percentage of the population in Catalonia that
       lives in each health region.
    4. Calculate the percentage of the population in each health region
       that is older than 65 years.

---

# Exploring the CatSalut registry data

* Calculate again the population of each health region, this time
  broken down by sex (using the column `gènere`). The resulting numbers
  should match the ones in the document entitled
  *"Població de referència del Servei Català de la Salut per a l'any 2025:
  dades per ABS i UP assignada"* available at this
  [link](https://scientiasalut.gencat.cat/handle/11351/11946.2).
  &nbsp;&nbsp;

---

# Merging with the infection surveillance data

* Fetch the data file called `mostres_analitzades.csv` that you have downloaded
  in previous practicals from the infection surveillance system of Catalonia
  ([SIVIC](https://sivic.salut.gencat.cat) in its Catalan acronym) and copy it
  into the directory `seminar5`.  
  &nbsp;&nbsp;
* Read the file `mostres_analitzades.csv` and extract in a new `data.frame` object
  the rows corresponding to the data from the first 15 days of January 2025.  
  &nbsp;&nbsp;
* Using these data from 2025, aggregate the number of positive
  cases by health region. In the resulting `data.frame` object, replace the
  values in the health region column by the same values in upper case using
  the function `toupper()`, and replace the column name `x` by `Positives`
  with the instruction `colnames(mbyr)[2] <- "Positives"`.

---

# Merging with the infection surveillance data

* Assuming the column of positive cases were the total number of cases per
  health region, merge the resulting `data.frame` with the population data
  that you have obtained from the CatSalut registry using the function
  `merge()`, and calculate the incidence of positive cases per 100,000
  inhabitants in each health region.  
  &nbsp;&nbsp;
* You will notice that the incidence is very low, which is because the number
  of positive cases that we have
  obtained from the SIVIC data is not the total number of cases, but only the
  number of positive cases among the samples that were analysed in the SIVIC
  system. However, this is just an example to show how to merge data from
  different sources and calculate an incidence rate per 100,000 inhabitants.

---

class: small-code

# Session information

``` r
> sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-apple-darwin20
Running under: macOS Sequoia 15.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] digest_0.6.39   R6_2.6.1        fastmap_1.2.0   xfun_0.56      
 [5] cachem_1.1.0    knitr_1.51      htmltools_0.5.9 rmarkdown_2.30 
 [9] lifecycle_1.0.5 cli_3.6.5       sass_0.4.10     jquerylib_0.1.4
[13] compiler_4.5.2  tools_4.5.2     evaluate_1.0.5  bslib_0.10.0   
[17] xaringan_0.31   yaml_2.3.12     otel_0.2.0      jsonlite_2.0.0 
[21] rlang_1.1.7    
```