class: title-slide, middle, center # Exploring data with R and RStudio ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc in Biomedical Sciences ### UPF School of Medicine and Life Sciences ### Academic Year 2025-2026 --- # Setup and background * To follow these slides you need an installation of [R](https://www.r-project.org) and [RStudio](https://posit.co/downloads). You should install R **first** and only once R has been installed, then you should install RStudio. * You can find installation instructions in the [setup](https://funcompbio.github.io/setup/#r-and-rstudio) link on how to install R **and** RStudio in your system. Once R and RStudio are installed, you should be able to start RStudio by double-clicking on an icon like the one here below.  --- # Exploring the CatSalut registry data * The Catalan health service (CatSalut) is organised in several health regions (called *"Regió Sanità ria"* in Catalan). Each health region is further subdivided into several health areas (called *"Àrea Bà sica de Salut"* in Catalan). See [here](https://catsalut.gencat.cat/ca/coneix-catsalut/catsalut-territori/regions-sanitaries) for more information about the health regions and health areas in Catalonia. * The data describing the reference population of each health area is available in a central population registry maintained by CatSalut, and which can be accessed by going to https://sivic.salut.gencat.cat/dades_obertes and clicking on the link called [Població de referència](https://analisi.transparenciacatalunya.cat/Salut/Registre-central-de-poblaci-del-CatSalut-poblaci-p/ftq4-h9vk). --- # Exploring the CatSalut registry data * Create a new directory called `seminar5`, where to store the data files that we will be working with in this seminar. * We will download the data of the reference population of the CatSalut. However, the whole data set is too big because it contains data since 2018. Instead, we will download only the data from 2025 using the following steps: 1. Go to the central population registry of CatSalut. 2. Click on the `Dades` tab. 3. Hover the mouse pointer on the first column called `Any`, and click on the hamburger menu that will appear on the right. 4. Click on the funnel icon to filter the data by year. 5. Write the value `2025` under the word `Iguala`. 6. Click on `Exportar` and in the popup window, leaving the default options untouched, click on `Descarregar`. * Copy the downloaded the file into the `seminar5` directory and rename it as `RegistreCatSalut2025.csv`. If for some reason you cannot download the file, you can download it from [this link](RegistreCatSalut2025.csv) and copy it into `seminar5`. --- # Exploring the CatSalut registry data * The data from the CatSalut population registry is in the CSV file `RegistreCatSalut2025.csv`. Read it into R and explore its contents. Try to figure out what the different columns mean and what kind of data they contain. * Using the columns `població.oficial` and `Regió.Sanità ria`, calculate the total population of each health region in Catalonia (**tip:** use the R function `aggregate`), and replace the column name `x` in the resulting `data.frame` object by `Total` with the instruction `colnames(pbyr)[2] <- "Total"` if `pbyr` was the name of the resulting `data.frame` object. Once you have obtained this calculation, try to answer the following questions: 1. Which is the most populated health region? 2. Which is the least populated one? 3. Calculate the percentage of the population in Catalonia that lives in each health region. 4. Calculate the percentage of the population in each health region that is older than 65 years. <!-- dat <- read.csv("RegistreCatSalut.csv", stringsAsFactors=TRUE) pbyr <- aggregate(dat$població.oficial, list(RS=dat$Regió.Sanità ria), sum) colnames(pbyr)[2] <- "Total" pbyr$pct <- 100 * pbyr$Total / sum(pbyr$Total) mask <- dat$edat > 65 datover65 <- dat[mask, ] pbyrover65 <- aggregate(datover65$població.oficial, list(RS=datover65$Regió.Sanità ria), sum) pbyrover65$pct <- 100 * pbyrover65$x / pbyr$x --> --- # Exploring the CatSalut registry data * Calculate again the population of each health region, this time broken down by sex (using the column `gènere`). The resulting numbers should match the ones in the document entitled *"Població de referència del Servei Català de la Salut per a l'any 2025: dades per ABS i UP assignada"* available at this [link](https://scientiasalut.gencat.cat/handle/11351/11946.2). <!-- dat <- read.csv("RegistreCatSalut.csv", stringsAsFactors=TRUE) pbyrs <- aggregate(dat$població.oficial, list(RS=dat$Regió.Sanità ria, SEX=dat$gènere), sum) --> --- # Merging with the infection surveillance data * Fetch the data file called `mostres_analitzades.csv` that you have downloaded in previous practicals from the infection surveillance system of Catalonia ([SIVIC](https://sivic.salut.gencat.cat) in its Catalan acronym) and copy it into the directory `seminar5`. * Read the file `mostres_analitzades.csv` and extract in a new `data.frame` object the rows corresponding to the data from the first 15 days of January 2025. * Using these data from 2025, aggregate the number of positive cases by health region. In the resulting `data.frame` object, replace the values in the health region column by the same values in upper case using the function `toupper()`, and replace the column name `x` by `Positives` with the instruction `colnames(mbyr)[2] <- "Positives"`. --- # Merging with the infection surveillance data * Assuming the column of positive cases were the total number of cases per health region, merge the resulting `data.frame` with the population data that you have obtained from the CatSalut registry using the function `merge()`, and calculate the incidence of positive cases per 100,000 inhabitants in each health region. * You will notice that the incidence is very low, which is because the number of positive cases that we have obtained from the SIVIC data is not the total number of cases, but only the number of positive cases among the samples that were analysed in the SIVIC system. However, this is just an example to show how to merge data from different sources and calculate an incidence rate per 100,000 inhabitants. <!-- datm <- read.csv("mostres_analitzades.csv", stringsAsFactors=TRUE) startdate <- as.Date(dat$data_inici, format="%d/%m/%Y") enddate <- as.Date(dat$data_final, format="%d/%m/%Y") mask <- startdate >= as.Date("2025-01-01") & enddate <= as.Date("2025-01-15") datm2 <- dat[mask, ] mbyr <- aggregate(datm2$positiu, list(RS=datm2$nom_regio), sum) mbyr$RS <- toupper(mbyr$RS) colnames(mbyr)[2] <- "Positives" dat2 <- merge(pbyr, mbyr) dat2$Incidence <- 100000 * dat2$Positives / dat2$Total --> --- class: small-code # Session information ``` r > sessionInfo() R version 4.5.2 (2025-10-31) Platform: x86_64-apple-darwin20 Running under: macOS Sequoia 15.7.3 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: Europe/Madrid tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.56 [5] cachem_1.1.0 knitr_1.51 htmltools_0.5.9 rmarkdown_2.30 [9] lifecycle_1.0.5 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4 [13] compiler_4.5.2 tools_4.5.2 evaluate_1.0.5 bslib_0.10.0 [17] xaringan_0.31 yaml_2.3.12 otel_0.2.0 jsonlite_2.0.0 [21] rlang_1.1.7 ```