Data manipulation with R

class: title-slide, middle, center

# Data manipulation with R

## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine and Life Sciences
### Universitat Pompeu Fabra

<br>

## Fundamentals of Computational Biology
### BSc in Biomedical Sciences
### UPF School of Medicine and Life Sciences
### Academic Year 2025-2026

---
class: center, middle, inverse

# Control flow, vectorization and implicit looping

---

## Control flow statements

* Sequentially evaluated statements may be written one below another using
  newlines or in the same line separated by semicolons (`;`):
<pre>
i <- 1 ; j <- 0
</pre>
* Conditional and looping statements have the condition surrounded by
  parentheses `( )` and the associated code by braces `{ }` (except in
  single-line code blocks).
<pre>
if (i > 0)
&nbsp;&nbsp;j <- i * 10
&nbsp;&nbsp;
while (i < j) {
&nbsp;&nbsp;i <- i + 1
&nbsp;&nbsp;print(i)
}
&nbsp;&nbsp;
for (i in 1:10) {
&nbsp;&nbsp;j <- j + i
&nbsp;&nbsp;print(j)
}
</pre>

---

## Interpretation and vectorization

* The R language is interpreted, vectorized and includes implicit memory
  management.  
  &nbsp;&nbsp;
* Looping statements are slow in R. The following example would be a suboptimal
  approach in R following a classical programming style:
  <pre>
  x <- 1:100
  logsum <- 0
  for (i in 1:length(x)) {
      logsum <- logsum + log(x[i])
  }
  </pre>
* This would be the vectorized optimal approach in R:
  <pre>
  logsum <- sum(log(x))
  </pre>

---

## Lists and Implicit looping

* In some cases we might still want to iterate through some specific elements.  
  &nbsp;&nbsp;
* R provides a more compact way to iterate over lists, and other objects, by using
  functions for implicit looping:
  * `lapply()` returns a list with the same length as the input, where each element
    is the result of the applied function.
  * `sapply()` attempts to simplify the resulting data structure, returning a 
    vector if the length of each list element is 1.
* To use this approach we usually want to convert out data into a list to be able
  to iterate through it. The `split()` function is the most efficient way to do it,
  as it will split an object (vector, data frame) into the groups defined by a factor
  variable.

---

## Lists and Implicit looping: an example

* R has an example `data.frame` object called `iris`, directly available
  after starting R, corresponding to a classical data set on the taxonomic
  characteristics of flowers used by
  [Fisher (1936)](https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x)
  to introduce certain statistical concepts. It contains measurements of the
  flowers of fifty plants for each of the three species _Iris setosa_,
  _Iris versicolor_ and _Iris virginica_.
<pre>
> head(iris)
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 
</pre>

---

## Lists and Implicit looping: an example

* Calculate mean [sepal](https://en.wikipedia.org/wiki/Sepal)
  length per species, using **explicit** looping:
<pre>
> means_all <- c()
> for (i in levels(iris$Species)) {
&nbsp;&nbsp;&nbsp;&nbsp;mask <- iris$Species == i
&nbsp;&nbsp;&nbsp;&nbsp;sepal_lengths <- iris$Sepal.Length&#091;mask&#093;
&nbsp;&nbsp;&nbsp;&nbsp;mean <- mean(sepal_lengths)
&nbsp;&nbsp;&nbsp;&nbsp;means_all <- c(means_all, mean)
&nbsp;&nbsp;}
> names(means_all) <- levels(iris$Species)
> means_all
  setosa versicolor  virginica 
   5.006      5.936      6.588 
</pre>

---

## Lists and Implicit looping: an example

* A more compact way, using **implicit** looping. First
  group values by category using the function `split()`:

<pre>
> sepal_list <- split(iris$Sepal.Length, iris$Species)
> class(sepal_list)
[1] "list"
> sepal_list
$setosa
 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
[20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
[39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
$versicolor
 [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
[20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
[39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
$virginica
 [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
[20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
[39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
</pre>

---

## Lists and Implicit looping: an example

* Second, use either the `lapply()` or `sapply()` function:

<pre>
> lapply(sepal_list, mean)
$setosa
[1] 5.006

$versicolor
[1] 5.936

$virginica
[1] 6.588
> sapply(sepal_list, mean)
  setosa versicolor  virginica 
   5.006      5.936      6.588 
</pre>

* The difference between `lapply()` and `sapply()` is that
  the former always returns a `list` object, while the latter
  returns a _simplified_ object (e.g., a vector) if possible.

---

## Summary statistics from data subsets

* The `aggregate()` function in R provides a more direct way
  to obtain summary statistics from data subsets:

<pre>
> aggregate(iris$Sepal.Length, list(Species=iris$Species), mean)
     Species     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588
</pre>

--
<pre>
> aggregate(iris[, colnames(iris) != "Species"],
            list(Species=iris$Species), mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
</pre>

* The second argument must be a `list` object and
  the result is a `data.frame` object.

---

## Summary statistics from data subsets

* The `aggregate()` function allows one to use a more
  compact _formula_ syntax:

<pre>
> aggregate(Sepal.Length ~ Species, data=iris, mean)
     Species Sepal.Length 
1     setosa        5.006
2 versicolor        5.936
3  virginica        6.588
</pre>

<pre>
> aggregate(. ~ Species, data=iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026
</pre>

* Compare the previous line of code with the more classical
  programming approach using explicit looping shown in a
  [previous slide](#7).

---

## Concluding remarks

* Control flow statements can conditionally execute code surrounded by braces.
  Identation is only necessary for readability.  
  &nbsp;&nbsp;
* Vectorized operations are much faster than loops.  
  &nbsp;&nbsp;
* Lists and implicit looping are a very efficient way to apply functions to 
  specific groups of data.  
  &nbsp;&nbsp;
* The aggregate function is a more direct way to calculate summary statistics
  from data subsets.

---

class: center, middle, inverse

# Data manipulation

---

## Sorting and permuting data

.left-column[
* Sorting, or [ordering](https://en.wikipedia.org/wiki/Total_order), data
  refers to arrange values in a particular order (e.g., ascending or
  descending).
  <pre style="font-size:10pt">
  > dat <- data.frame(NAME=c("Joan", "Maria",
                             "Sílvia"),
                        "TIME (min.)"=c(27, 25,
                                        29),
                        check.names=FALSE)
  &nbsp;&nbsp;
  > dat
        NAME TIME (min.)
  1   Joan          27
  2  Maria          25
  3 Sílvia          29
  &nbsp;&nbsp;
  > sort(dat[["TIME (min.)"]])
  [1] 25 27 29
  </pre>
]

.right-column[
<p style="margin-top: -20px">
* [Permuting](https://en.wikipedia.org/wiki/Permutation) data refers to
  arrange observations by a given vector of indices that produce an order.
  <pre style="font-size:10pt">
  > ord <- order(dat[["TIME (min.)"]])
  &nbsp;&nbsp;
  > ord
  [1] 2 1 3
  > dat[ord, ]
      NAME TIME (min.)
  2  Maria          25
  1   Joan          27
  3 Sílvia          29
  </pre>
]

---

## Sorting and permuting data

* In some cases we might be interested in sorting or ordering our objects. 
  <pre>
  > numbers <- round(runif(15, min=0, max=50))
  > numbers
   [1] 47 34 45 24 18  0  3  3 15 24 37 31 34 26  6
  </pre>
* There are two main functions to do this:
  * `sort()` allows to sort a vector into ascending or descending order.
    <pre>
    > sort(numbers, decreasing=FALSE)
    [1]  0  3  3  6 15 18 24 24 26 31 34 34 37 45 47
    </pre>
  * `order()` returns a permutation of the positions of the input vector that
    is typically used to rearrange a vector into ascending or descending order,
    as well as the rows or columns of `matrix` and `data.frame` objects.
    <pre>
    > order(numbers, decreasing=FALSE)
    [1]  6  7  8 15  9  5  4 10 14 12  2 13 11  3  1
    </pre>

---

## Sorting and permuting data

* Example of using `order()` to sort the rows of a `data.frame` object:
<pre>
> dat1 <- data.frame(COUNTY=c("OSONA", "MARESME", "SEGRIA"),
                         POPULATION=c(180364, 506901, 233615))
> dat1
       COUNTY POPULATION
1   OSONA     180364
2 MARESME     506901
3  SEGRIA     233615
> dat1[order(dat1$COUNTY), ]
       COUNTY POPULATION
2 MARESME     506901
1   OSONA     180364
3  SEGRIA     233615
> dat1[order(dat1$POPULATION), ]
       COUNTY POPULATION
1   OSONA     180364
3  SEGRIA     233615
2 MARESME     506901
</pre>

---

## Pasting rows and columns

* Rows and columns can be _pasted_ into a data frame using:
  * `cbind()` to paste columns.
    <pre>
    > dat2 <- data.frame(COUNTY=c("OSONA", "MARESME", "SEGRIA"),
                             VACCINATION=c(119624, 343637, 156648))
    > dat3 <- cbind(dat1, VACCINATION=dat2[, "VACCINATION"])
    > dat3
           COUNTY POPULATION VACCINATION
    1   OSONA     180364      119624
    2 MARESME     506901      343637
    3  SEGRIA     233615      156648
    </pre>

⚠️ __Important__: When pasting data you have to make sure 
that they have the same dimensions. In the case of pasting columns with `cbind`,
`data.frame` objects must have the same number of rows.

---

## Pasting rows and columns

* Rows and columns can be _pasted_ into a data frame using:
  * `rbind()` to paste rows.
  <pre>
  > dat4 <- data.frame(COUNTY="SELVA", POPULATION=191432,
                             VACCINATION=122082)
  > dat5 <- rbind(dat3, dat4)
  > dat5
           COUNTY POPULATION VACCINATION
  1   OSONA     180364      119624
  2 MARESME     506901      343637
  3  SEGRIA     233615      156648
  4   SELVA     191432      122082
  </pre>

⚠️ __Important__: When pasting data you have to make sure 
that they have the same dimensions. In the case of pasting rows with `rbind`,
`data.frame` objects must have the same number of columns with identical column names.

---

## The recycling rule

* When performing operations between vectors of different length, R
  follows a specific policy called the _recycling_ rule, by which the
  shorter vector is _recycled_.  
  &nbsp;&nbsp;
* This rule is straightforward when the length of the longer vector is multiple
  of the length of the shorter vector:
  <pre>
  > 2 * 1:4
  [1] 2 4 6 8
  > 2:3 * 1:4
  [1] 2  6  6  12
  </pre>
* When the length of the longer vector is not multiple of the length of the
  shorter vector, R issues a warning:
  <pre>
  > 2:3 * 1:5
  [1] 2  6  6 12 10
  Warning message:
  In 2:3 * 1:5 :
    longer object length is not a multiple of shorter object length
  </pre>

---

## Adding columns

* We can add a new column to a `data.frame` object by doing an assignment of a
  vector of values on that new column, specified with either the `[[` or the `$`
  operator, followed by the name of the new column.  
&nbsp;&nbsp;
* You can perform operations with columns and other columns and/or vectors and 
store the result in a new column.
  <pre>
  > dat5$VPERCENTAGE <- 100 * dat5$VACCINATION / dat5$POPULATION
  > dat5
       COUNTY POPULATION VACCINATION VPERCENTAGE
  1   OSONA     180364      119624    66.32366
  2 MARESME     506901      343637    67.79174
  3  SEGRIA     233615      156648    67.05391
  4   SELVA     191432      122082    63.77304
  </pre>
* Note that, in the above line, R applied the recycling rule, reusing the
  value `100` to multiply every value in the vector resulting from the
  division operation.

---

## Combining datasets

* Combining datasets of different dimensions is one of the most common
  operations performed for answering questions from data.

![:scale 50%](data:image/png;base64,#img/ExampleNaturalJoin.png)

* In this example, once we combine the city population with
  the absolute incidence, we can now calculate the
  [cumulative incidence](https://en.wikipedia.org/wiki/Incidence_%28epidemiology%29)
  per 100,000 inhabitants:
  <pre>
  > dat$CASESPER100K <- 100000 * dat$CASES / dat$POPULATION
  </pre>

---

## Combining datasets: `match()`

* `match()`: returns a vector of the positions of (first) matches
  of its first argument in its second.
  <pre>
  > datinc <- data.frame(CITY=c("Barcelona", "Barcelona",
                                  "Girona", "Girona"),
                           SEX=c("Male", "Female", "Male", "Female"),
                           CASES=c(178, 151, 58, 49))
  > datpop <- data.frame(CITY=c("Barcelona", "Girona", "Lleida"),
                           POPULATION=c(1664182, 103369, 140403))
  > mt <- match(datinc$CITY, datpop$CITY)
  > mt
  [1] 1 1 2 2
  > datinc$POPULATION <- datpop$POPULATION[mt]
  > datinc
           CITY    SEX CASES POPULATION
  1 Barcelona   Male   178    1664182
  2 Barcelona Female   151    1664182
  3    Girona   Male    58     103369
  4    Girona Female    49     103369
  </pre>

---

## Combining datasets: `merge()`

* `merge()`: combine two `data.frame` objects by common columns or row names,
  or do other versions of database [_join_](https://en.wikipedia.org/wiki/Relational_algebra#Joins_and_join-like_operators) operations. We can specify different column
  names for _joining_ using the arguments `by.x` and `by.y`.
  <pre>
  > merge(datinc, datpop)
           CITY    SEX CASES POPULATION
  1 Barcelona   Male   178    1664182
  2 Barcelona Female   151    1664182
  3    Girona   Male    58     103369
  4    Girona Female    49     103369
  </pre>

---

## Combining datasets: `merge()`

* Parameters `all`, `all.x` and `all.y` in the `merge()` function control
  the way in which non-matching rows are returned, and their possible settings
  correspond to different type of _join operations_.  
  &nbsp;&nbsp;
![](data:image/png;base64,#img/mergeJoins.jpg)
.footer[Image from [R for Dummies, _How to use the merge() function with data sets in R_.](https://www.dummies.com/book/technology/programming-web-design/r/r-for-dummies-2nd-edition-281846)]
  <pre>
  > merge(datinc, datpop, all=TRUE)
           CITY    SEX CASES POPULATION
  1 Barcelona   Male   178    1664182
  2 Barcelona Female   151    1664182
  3    Girona   Male    58     103369
  4    Girona Female    49     103369
  5    Lleida   &lt;NA&gt;    NA     140403
  </pre>

---

## Concluding remarks

* Analyzing data involves so-called
  [data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)
  steps in which we may have to reorder data, paste rows and columns,
  combine datasets or add new columns.  
  &nbsp;&nbsp;
* The recycling rule is a powerful feature, but beware of situations
  in which it may be unintentionally applied.  
  &nbsp;&nbsp;
* It is important to understand the difference between _sorting_ and
  _permuting_, which we can do in R with the functions `sort()` and `order()`.  
  &nbsp;&nbsp;
* Pasting rows and columns is straightforward with `rbind()` and `cbind()`.  
  &nbsp;&nbsp;
* Most of the times, we will be interested in the _natural join_ operation,
  in which we combine rows from two datasets with **matching values in a common
  column**. This can be achieved either with the `match()` function or the
  `merge()` function with default parameters.