class: title-slide, middle, center # Automation ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc on Human Biology ### UPF School of Medicine and Life Sciences ### Academic Year 2024-2025 --- ## Why should we _automate_ data analysis? .left-column[ * Data analysis can be done in a myriad number of ways. * As important as choosing the right way to analyse data, is to be able to *re-run* the analysis in different ways. * Re-running data analyses is essential for enabling reproducibility. * _Manual and interactive intervention_ during data analysis introduces cost in *re-running* it and compromises its reproducibility. * To facilitate re-running data analyses we need to **automatize** them. ] .right-column[ <p style="margin-top:-20px"> ![:scale 75% ](data:image/png;base64,#img/automation.png) ![:scale 75% ](data:image/png;base64,#img/main-control.gif) ] .footer[ Top image from [xkcd](https://xkcd.com/1319), bottom image from Brian Schilder at [GitHub](https://github.com/neurogenomics/rworkflows/discussions/93). ] --- ## Data analysis pipelines .left-column[ * A data analysis pipeline is a chain of data analysis steps including, but not limited to: * Downloading/acquiring data. * Extracting/transforming data. * Cleaning data. * Standardizing data. * Exploring/summarizing data. * Modeling data. * Reporting results. ] .right-column[ ![:scale 40%](data:image/png;base64,#img/dataanalysispipeline.png) ] --- ## Make and Makefiles * In the context of software development, certain programming languages such as Fortran, C or C++, require their source code to be [_compiled_](https://en.wikipedia.org/wiki/Compiler) to obtain an [executable program](https://en.wikipedia.org/wiki/Executable). * One single executable could be the result of _compiling_ multiple files of source code with many, often intrincate, dependencies among them. * In that context, [Stuart Feldman](https://en.wikipedia.org/wiki/Stuart_Feldman), a researcher at Bell Labs, developed in 1976 a program called [_make_](https://en.wikipedia.org/wiki/Make_%28software%29), which would execute the compiling instructions in the required order, according to a set of rules described in a text file called a [_Makefile_](https://en.wikipedia.org/wiki/Makefile). * The software _make_ and _Makefiles_ are still nowadays intensively used in software development to automate the process of producing a single executable program from a set of source code files. * Likewise, _make_ and _Makefiles_ can be used to automate data analysis pipelines; see the blog post ["Why Use Make"](https://bost.ocks.org/mike/make) by [Mike Bostock](https://en.wikipedia.org/wiki/Mike_Bostock), former head of data-visualization projects at the New York Times and current CTO at [Observable](https://observablehq.com). --- ## Make and Makefiles * To work with _make_ and _Makefiles_ you need first to identify which are the dependencies among your files, and then express those dependencies _backwards_ as rules in a file called **`Makefile`** using a specific syntax. .left-column[ <pre style="font-size:80%; margin-left:20px;"> all : analysis.html analysis.html : analysis.Rmd processed_data.csv Rscript -e 'rmarkdown::render("analysis.Rmd")' processed_data.csv : raw_data.csv python process_data.py raw_data.csv </pre> * Finally, to run _make_ according to the rules given in the _Makefile_ you need to type in the Unix shell: <pre style="font-size:80%; margin-left:20px;"> $ make </pre> * The software _make_ will look up what files have _changed_ and trigger the corresponding rule, accordingly. ] .right-column[ <p style="margin-left:100px;"> ![:scale 50%](data:image/png;base64,#img/dataanalysispipelinebackwards.png) ] --- ## Make and Makefiles * The syntax in a _Makefile_ is the following: <pre> all : target-file target-file : prerequisites <--TAB--> action-command-to-produce-the-target-file </pre> * Note that **there is a TAB character** before the _action command_ associated to a rule. Single spaces **will not work**, you really need to put a TAB character. * The `all` rule may have more than one file as prerequisite and is the rule that _make_ will first look up. However, you can also call make with a `target-file` as argument, and will only execute the rules to obtain that target, e.g.: <pre> $ make target-file </pre> * There are additional ways to express rules, such as when having a large quantity of target files and want to avoid writing a rule for each of them. You can find a more comprehensive tutorial by [Karl Broman](https://www.biostat.wisc.edu/staff/broman-karl) [here](https://kbroman.org/minimal_make). --- ## Concluding remarks * _make_ and _Makefiles_ can help us automating our data analysis pipeline. * You can find further materials on using _make_ and _Makefiles_ in this [automation chapter](https://stat545.com/automation-overview.html) from a course on data science by [Jenny Bryan](https://jennybryan.org) and in this [workshop material](https://swcarpentry.github.io/make-novice) by [The Software Carpentry](https://software-carpentry.org). * Further topics on automation that we have not covered here are: * [Unit testing](https://en.wikipedia.org/wiki/Unit_testing) for automatically checking that our pipeline behaves as expected. In Python we have the module [pytest](https://docs.pytest.org/en/stable/getting-started.html) for that purpose, while in R we have the packages [RUnit](https://cran.r-project.org/web/packages/RUnit/index.html) and [testthat](https://testthat.r-lib.org). * [Continuous integration (CI)](https://en.wikipedia.org/wiki/Continuous_integration) for triggering automated unit tests each time we update our pipeline. Some of the most widely used CI environments are [Travis CI](https://travis-ci.org) and [GitHub Actions](https://github.com/features/actions).