class: title-slide, middle, center # Data manipulation with R ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc in Biomedical Sciences ### UPF School of Medicine and Life Sciences ### Academic Year 2025-2026 --- class: center, middle, inverse # Control flow, vectorization and implicit looping --- ## Control flow statements * Sequentially evaluated statements may be written one below another using newlines or in the same line separated by semicolons (`;`): <pre> i <- 1 ; j <- 0 </pre> * Conditional and looping statements have the condition surrounded by parentheses `( )` and the associated code by braces `{ }` (except in single-line code blocks). <pre> if (i > 0) j <- i * 10 while (i < j) { i <- i + 1 print(i) } for (i in 1:10) { j <- j + i print(j) } </pre> --- ## Interpretation and vectorization * The R language is interpreted, vectorized and includes implicit memory management. * Looping statements are slow in R. The following example would be a suboptimal approach in R following a classical programming style: <pre> x <- 1:100 logsum <- 0 for (i in 1:length(x)) { logsum <- logsum + log(x[i]) } </pre> * This would be the vectorized optimal approach in R: <pre> logsum <- sum(log(x)) </pre> --- ## Lists and Implicit looping * In some cases we might still want to iterate through some specific elements. * R provides a more compact way to iterate over lists, and other objects, by using functions for implicit looping: * `lapply()` returns a list with the same length as the input, where each element is the result of the applied function. * `sapply()` attempts to simplify the resulting data structure, returning a vector if the length of each list element is 1. * To use this approach we usually want to convert out data into a list to be able to iterate through it. The `split()` function is the most efficient way to do it, as it will split an object (vector, data frame) into the groups defined by a factor variable. --- ## Lists and Implicit looping: an example * R has an example `data.frame` object called `iris`, directly available after starting R, corresponding to a classical data set on the taxonomic characteristics of flowers used by [Fisher (1936)](https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x) to introduce certain statistical concepts. It contains measurements of the flowers of fifty plants for each of the three species _Iris setosa_, _Iris versicolor_ and _Iris virginica_. <pre> > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > levels(iris$Species) [1] "setosa" "versicolor" "virginica" </pre> --- ## Lists and Implicit looping: an example * Calculate mean [sepal](https://en.wikipedia.org/wiki/Sepal) length per species, using **explicit** looping: <pre> > means_all <- c() > for (i in levels(iris$Species)) { mask <- iris$Species == i sepal_lengths <- iris$Sepal.Length[mask] mean <- mean(sepal_lengths) means_all <- c(means_all, mean) } > names(means_all) <- levels(iris$Species) > means_all setosa versicolor virginica 5.006 5.936 6.588 </pre> --- ## Lists and Implicit looping: an example * A more compact way, using **implicit** looping. First group values by category using the function `split()`: <pre> > sepal_list <- split(iris$Sepal.Length, iris$Species) > class(sepal_list) [1] "list" > sepal_list $setosa [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 $versicolor [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 $virginica [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9 </pre> --- ## Lists and Implicit looping: an example * Second, use either the `lapply()` or `sapply()` function: <pre> > lapply(sepal_list, mean) $setosa [1] 5.006 $versicolor [1] 5.936 $virginica [1] 6.588 > sapply(sepal_list, mean) setosa versicolor virginica 5.006 5.936 6.588 </pre> * The difference between `lapply()` and `sapply()` is that the former always returns a `list` object, while the latter returns a _simplified_ object (e.g., a vector) if possible. --- ## Summary statistics from data subsets * The `aggregate()` function in R provides a more direct way to obtain summary statistics from data subsets: <pre> > aggregate(iris$Sepal.Length, list(Species=iris$Species), mean) Species x 1 setosa 5.006 2 versicolor 5.936 3 virginica 6.588 </pre> -- <pre> > aggregate(iris[, colnames(iris) != "Species"], list(Species=iris$Species), mean) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026 </pre> * The second argument must be a `list` object and the result is a `data.frame` object. --- ## Summary statistics from data subsets * The `aggregate()` function allows one to use a more compact _formula_ syntax: <pre> > aggregate(Sepal.Length ~ Species, data=iris, mean) Species Sepal.Length 1 setosa 5.006 2 versicolor 5.936 3 virginica 6.588 </pre> <pre> > aggregate(. ~ Species, data=iris, mean) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026 </pre> * Compare the previous line of code with the more classical programming approach using explicit looping shown in a [previous slide](#7). --- ## Concluding remarks * Control flow statements can conditionally execute code surrounded by braces. Identation is only necessary for readability. * Vectorized operations are much faster than loops. * Lists and implicit looping are a very efficient way to apply functions to specific groups of data. * The aggregate function is a more direct way to calculate summary statistics from data subsets. --- class: center, middle, inverse # Data manipulation --- ## Sorting and permuting data .left-column[ * Sorting, or [ordering](https://en.wikipedia.org/wiki/Total_order), data refers to arrange values in a particular order (e.g., ascending or descending). <pre style="font-size:10pt"> > dat <- data.frame(NAME=c("Joan", "Maria", "Sílvia"), "TIME (min.)"=c(27, 25, 29), check.names=FALSE) > dat NAME TIME (min.) 1 Joan 27 2 Maria 25 3 Sílvia 29 > sort(dat[["TIME (min.)"]]) [1] 25 27 29 </pre> ] .right-column[ <p style="margin-top: -20px"> * [Permuting](https://en.wikipedia.org/wiki/Permutation) data refers to arrange observations by a given vector of indices that produce an order. <pre style="font-size:10pt"> > ord <- order(dat[["TIME (min.)"]]) > ord [1] 2 1 3 > dat[ord, ] NAME TIME (min.) 2 Maria 25 1 Joan 27 3 Sílvia 29 </pre> ] --- ## Sorting and permuting data * In some cases we might be interested in sorting or ordering our objects. <pre> > numbers <- round(runif(15, min=0, max=50)) > numbers [1] 47 34 45 24 18 0 3 3 15 24 37 31 34 26 6 </pre> * There are two main functions to do this: * `sort()` allows to sort a vector into ascending or descending order. <pre> > sort(numbers, decreasing=FALSE) [1] 0 3 3 6 15 18 24 24 26 31 34 34 37 45 47 </pre> * `order()` returns a permutation of the positions of the input vector that is typically used to rearrange a vector into ascending or descending order, as well as the rows or columns of `matrix` and `data.frame` objects. <pre> > order(numbers, decreasing=FALSE) [1] 6 7 8 15 9 5 4 10 14 12 2 13 11 3 1 </pre> --- ## Sorting and permuting data * Example of using `order()` to sort the rows of a `data.frame` object: <pre> > dat1 <- data.frame(COUNTY=c("OSONA", "MARESME", "SEGRIA"), POPULATION=c(180364, 506901, 233615)) > dat1 COUNTY POPULATION 1 OSONA 180364 2 MARESME 506901 3 SEGRIA 233615 > dat1[order(dat1$COUNTY), ] COUNTY POPULATION 2 MARESME 506901 1 OSONA 180364 3 SEGRIA 233615 > dat1[order(dat1$POPULATION), ] COUNTY POPULATION 1 OSONA 180364 3 SEGRIA 233615 2 MARESME 506901 </pre> --- ## Pasting rows and columns * Rows and columns can be _pasted_ into a data frame using: * `cbind()` to paste columns. <pre> > dat2 <- data.frame(COUNTY=c("OSONA", "MARESME", "SEGRIA"), VACCINATION=c(119624, 343637, 156648)) > dat3 <- cbind(dat1, VACCINATION=dat2[, "VACCINATION"]) > dat3 COUNTY POPULATION VACCINATION 1 OSONA 180364 119624 2 MARESME 506901 343637 3 SEGRIA 233615 156648 </pre> ⚠️ __Important__: When pasting data you have to make sure that they have the same dimensions. In the case of pasting columns with `cbind`, `data.frame` objects must have the same number of rows. --- ## Pasting rows and columns * Rows and columns can be _pasted_ into a data frame using: * `rbind()` to paste rows. <pre> > dat4 <- data.frame(COUNTY="SELVA", POPULATION=191432, VACCINATION=122082) > dat5 <- rbind(dat3, dat4) > dat5 COUNTY POPULATION VACCINATION 1 OSONA 180364 119624 2 MARESME 506901 343637 3 SEGRIA 233615 156648 4 SELVA 191432 122082 </pre> ⚠️ __Important__: When pasting data you have to make sure that they have the same dimensions. In the case of pasting rows with `rbind`, `data.frame` objects must have the same number of columns with identical column names. --- ## The recycling rule * When performing operations between vectors of different length, R follows a specific policy called the _recycling_ rule, by which the shorter vector is _recycled_. * This rule is straightforward when the length of the longer vector is multiple of the length of the shorter vector: <pre> > 2 * 1:4 [1] 2 4 6 8 > 2:3 * 1:4 [1] 2 6 6 12 </pre> * When the length of the longer vector is not multiple of the length of the shorter vector, R issues a warning: <pre> > 2:3 * 1:5 [1] 2 6 6 12 10 Warning message: In 2:3 * 1:5 : longer object length is not a multiple of shorter object length </pre> --- ## Adding columns * We can add a new column to a `data.frame` object by doing an assignment of a vector of values on that new column, specified with either the `[[` or the `$` operator, followed by the name of the new column. * You can perform operations with columns and other columns and/or vectors and store the result in a new column. <pre> > dat5$VPERCENTAGE <- 100 * dat5$VACCINATION / dat5$POPULATION > dat5 COUNTY POPULATION VACCINATION VPERCENTAGE 1 OSONA 180364 119624 66.32366 2 MARESME 506901 343637 67.79174 3 SEGRIA 233615 156648 67.05391 4 SELVA 191432 122082 63.77304 </pre> * Note that, in the above line, R applied the recycling rule, reusing the value `100` to multiply every value in the vector resulting from the division operation. --- ## Combining datasets * Combining datasets of different dimensions is one of the most common operations performed for answering questions from data.  * In this example, once we combine the city population with the absolute incidence, we can now calculate the [cumulative incidence](https://en.wikipedia.org/wiki/Incidence_%28epidemiology%29) per 100,000 inhabitants: <pre> > dat$CASESPER100K <- 100000 * dat$CASES / dat$POPULATION </pre> --- ## Combining datasets: `match()` * `match()`: returns a vector of the positions of (first) matches of its first argument in its second. <pre> > datinc <- data.frame(CITY=c("Barcelona", "Barcelona", "Girona", "Girona"), SEX=c("Male", "Female", "Male", "Female"), CASES=c(178, 151, 58, 49)) > datpop <- data.frame(CITY=c("Barcelona", "Girona", "Lleida"), POPULATION=c(1664182, 103369, 140403)) > mt <- match(datinc$CITY, datpop$CITY) > mt [1] 1 1 2 2 > datinc$POPULATION <- datpop$POPULATION[mt] > datinc CITY SEX CASES POPULATION 1 Barcelona Male 178 1664182 2 Barcelona Female 151 1664182 3 Girona Male 58 103369 4 Girona Female 49 103369 </pre> --- ## Combining datasets: `merge()` * `merge()`: combine two `data.frame` objects by common columns or row names, or do other versions of database [_join_](https://en.wikipedia.org/wiki/Relational_algebra#Joins_and_join-like_operators) operations. We can specify different column names for _joining_ using the arguments `by.x` and `by.y`. <pre> > merge(datinc, datpop) CITY SEX CASES POPULATION 1 Barcelona Male 178 1664182 2 Barcelona Female 151 1664182 3 Girona Male 58 103369 4 Girona Female 49 103369 </pre> --- ## Combining datasets: `merge()` * Parameters `all`, `all.x` and `all.y` in the `merge()` function control the way in which non-matching rows are returned, and their possible settings correspond to different type of _join operations_.  .footer[Image from [R for Dummies, _How to use the merge() function with data sets in R_.](https://www.dummies.com/book/technology/programming-web-design/r/r-for-dummies-2nd-edition-281846)] <pre> > merge(datinc, datpop, all=TRUE) CITY SEX CASES POPULATION 1 Barcelona Male 178 1664182 2 Barcelona Female 151 1664182 3 Girona Male 58 103369 4 Girona Female 49 103369 5 Lleida <NA> NA 140403 </pre> --- ## Concluding remarks * Analyzing data involves so-called [data wrangling](https://en.wikipedia.org/wiki/Data_wrangling) steps in which we may have to reorder data, paste rows and columns, combine datasets or add new columns. * The recycling rule is a powerful feature, but beware of situations in which it may be unintentionally applied. * It is important to understand the difference between _sorting_ and _permuting_, which we can do in R with the functions `sort()` and `order()`. * Pasting rows and columns is straightforward with `rbind()` and `cbind()`. * Most of the times, we will be interested in the _natural join_ operation, in which we combine rows from two datasets with **matching values in a common column**. This can be achieved either with the `match()` function or the `merge()` function with default parameters.