Automation

class: title-slide, middle, center

# Automation

## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine and Life Sciences
### Universitat Pompeu Fabra

<br>

## Fundamentals of Computational Biology
### BSc on Human Biology
### UPF School of Medicine and Life Sciences
### Academic Year 2024-2025

---

## Why should we _automate_ data analysis?

.left-column[
* Data analysis can be done in a myriad number of ways.  
  &nbsp;&nbsp;
* As important as choosing the right way to analyse data, is to be able to
  *re-run* the analysis in different ways.  
  &nbsp;&nbsp;
* Re-running data analyses is essential for enabling reproducibility.  
  &nbsp;&nbsp;
* _Manual and interactive intervention_ during data analysis introduces cost
  in *re-running* it and compromises its reproducibility.  
  &nbsp;&nbsp;
* To facilitate re-running data analyses we need to **automatize** them.
]
.right-column[
<p style="margin-top:-20px">
![:scale 75% ](data:image/png;base64,#img/automation.png)
![:scale 75% ](data:image/png;base64,#img/main-control.gif)
]

.footer[
Top image from [xkcd](https://xkcd.com/1319), bottom image from Brian Schilder
at [GitHub](https://github.com/neurogenomics/rworkflows/discussions/93).
]

---

## Data analysis pipelines

.left-column[
* A data analysis pipeline is a chain of data analysis steps including,
but not limited to:
  * Downloading/acquiring data.  
    &nbsp;&nbsp;
  * Extracting/transforming data.  
    &nbsp;&nbsp;
  * Cleaning data.  
    &nbsp;&nbsp;
  * Standardizing data.  
    &nbsp;&nbsp;
  * Exploring/summarizing data.  
    &nbsp;&nbsp;
  * Modeling data.  
    &nbsp;&nbsp;
  * Reporting results.
]
.right-column[
![:scale 40%](data:image/png;base64,#img/dataanalysispipeline.png)
]

---

## Make and Makefiles

* In the context of software development, certain programming languages
  such as Fortran, C or C++, require their source code to be
  [_compiled_](https://en.wikipedia.org/wiki/Compiler) to obtain an
  [executable program](https://en.wikipedia.org/wiki/Executable).  
  &nbsp;&nbsp;
* One single executable could be the result of _compiling_ multiple files
  of source code with many, often intrincate, dependencies among them.  
  &nbsp;&nbsp;
* In that context, [Stuart Feldman](https://en.wikipedia.org/wiki/Stuart_Feldman),
  a researcher at Bell Labs, developed in 1976 a program called
  [_make_](https://en.wikipedia.org/wiki/Make_%28software%29), which would
  execute the compiling instructions in the required order, according to a
  set of rules described in a text file called a
  [_Makefile_](https://en.wikipedia.org/wiki/Makefile).  
  &nbsp;&nbsp;
* The software _make_ and _Makefiles_ are still nowadays intensively used in
  software development to automate the process of producing a single executable
  program from a set of source code files.  
  &nbsp;&nbsp;
* Likewise, _make_ and _Makefiles_ can be used to automate data analysis
  pipelines; see the blog post ["Why Use Make"](https://bost.ocks.org/mike/make)
  by [Mike Bostock](https://en.wikipedia.org/wiki/Mike_Bostock), former head of
  data-visualization projects at the New York Times and current CTO at
  [Observable](https://observablehq.com).

---

## Make and Makefiles

* To work with _make_ and _Makefiles_ you need first to identify which are
the dependencies among your files, and then express those dependencies
_backwards_ as rules in a file called **`Makefile`** using a specific syntax.

.left-column[

<pre style="font-size:80%; margin-left:20px;">
all : analysis.html

analysis.html : analysis.Rmd processed_data.csv
     Rscript -e 'rmarkdown::render("analysis.Rmd")'

processed_data.csv : raw_data.csv
     python process_data.py raw_data.csv
</pre>

* Finally, to run _make_ according to the rules given in the _Makefile_
you need to type in the Unix shell:

* The software _make_ will look up what files have _changed_
and trigger the corresponding rule, accordingly.
]
.right-column[
<p style="margin-left:100px;">
![:scale 50%](data:image/png;base64,#img/dataanalysispipelinebackwards.png)
]

---

## Make and Makefiles

* The syntax in a _Makefile_ is the following:
<pre>
all : target-file
&nbsp;&nbsp;
target-file : prerequisites
<--TAB--> action-command-to-produce-the-target-file
</pre>
* Note that **there is a TAB character** before the _action command_
  associated to a rule. Single spaces **will not work**, you really need to
  put a TAB character.  
  &nbsp;&nbsp;
* The `all` rule may have more than one file as prerequisite and is the
  rule that _make_ will first look up. However, you can also call make with
  a `target-file` as argument, and will only execute the rules to obtain
  that target, e.g.:
<pre>
$ make target-file
</pre>
* There are additional ways to express rules, such as when having a large
  quantity of target files and want to avoid writing a rule for each of them.
  You can find a more comprehensive tutorial by
  [Karl Broman](https://www.biostat.wisc.edu/staff/broman-karl)
  [here](https://kbroman.org/minimal_make).

---

## Concluding remarks

* _make_ and _Makefiles_ can help us automating our data analysis pipeline.  
&nbsp;&nbsp;
* You can find further materials on using _make_ and _Makefiles_ in this
[automation chapter](https://stat545.com/automation-overview.html) from a
course on data science by
[Jenny Bryan](https://jennybryan.org) and in this
[workshop material](https://swcarpentry.github.io/make-novice) by
[The Software Carpentry](https://software-carpentry.org).  
&nbsp;&nbsp;
* Further topics on automation that we have not covered here are:
  * [Unit testing](https://en.wikipedia.org/wiki/Unit_testing) for
  automatically checking that our pipeline behaves as expected. In
  Python we have the module
  [pytest](https://docs.pytest.org/en/stable/getting-started.html)
  for that purpose, while in R we have the packages
  [RUnit](https://cran.r-project.org/web/packages/RUnit/index.html)
  and
  [testthat](https://testthat.r-lib.org).  
  &nbsp;&nbsp;
  * [Continuous integration (CI)](https://en.wikipedia.org/wiki/Continuous_integration)
  for triggering automated unit tests each time we update our pipeline. Some
  of the most widely used CI environments are
  [Travis CI](https://travis-ci.org) and
  [GitHub Actions](https://github.com/features/actions).