class: title-slide, middle, center # Reproducibility with <br>literate programming ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and Life Sciences ### Universitat Pompeu Fabra <br> ## Fundamentals of Computational Biology ### BSc on Human Biology ### UPF School of Medicine and Life Sciences ### Academic Year 2024-2025 --- ## Reproducibility .left-column[ * Scientific discoveries never arise from the thin air, they are always built upon previous scientific knowledge. * Scientific knowledge is based on the premise that scientifically proven facts or theories can be [reproduced](https://en.wikipedia.org/wiki/Reproducibility) (observed or verified again and again). ![:scale 100%](data:image/png;base64,#img/reproducibility_opener.jpg) ] .right-column[ ![:scale 70%](data:image/png;base64,#img/the_difference.png) ] .footer[ Left image from [ScienceNews](https://www.sciencenews.org/article/redoing-scientific-research-best-way-find-truth), right image from [xkcd](https://xkcd.com/242). ] --- ## Reproducibility .left-column[ * In research involving computational means (data and software) reproducing a result can be also challenging. ![:scale 90%](data:image/png;base64,#img/KimDumas18fig1.jpg) ] -- .right-column[ <p style="margin-top: -20px"> * It should be, however, always feasible! [(Claerbout and Karrenbach, 1992)](https://library.seg.org/doi/abs/10.1190/1.1822162). * Within the reproducibility spectrum, scientific and industry standards are evolving to increase the level of reproducibility. ![:scale 90%](data:image/png;base64,#img/Peng11fig1.jpg) ] .footer[ Left image from [(Kim et al., 2018)](https://dx.doi.org/10.1093%2Fgigascience%2Fgiy077), right image from [(Peng et al., 2011)](https://dx.doi.org/10.1126/science.1213847). ] --- ## Literate programming * [Literate programming (LP)](https://en.wikipedia.org/wiki/Literate_programming) is a paradigm in which programming statements are interspersed with documentation. * LP was introduced by [Donald Knuth](https://en.wikipedia.org/wiki/Donald_Knuth) in 1984, developing the so-called [WEB System](https://en.wikipedia.org/wiki/Web_%28programming_system%29). ![](data:image/png;base64,#img/WebSystem.png) * An LP tool produces two files: the _tangled_ source code of the computer program and the whole document, including the text _woven_ with the code, formatted for displaying or printing. --- ## Literate programming * There are many LP tools for different programming languages. Some of them are: | Name | Programming language | Result | |---------|----------------------|--------| | WEB | Pascal | PDF | | CWEB | C and C++ | PDF | | NoWEB | any | PDF | | Sweave | R | PDF | | Jupyter Notebook | Python | HTML | | R Markdown | R | HTML | * Currently, [Jupyter Notebooks](https://jupyter.org) and [R Markdown](https://rmarkdown.rstudio.com) are two most widely used LP tools for doing reproducible analyses on the computer. * The rest of this presentation will briefly introduce R Markdown, but the R Markdown [website by RStudio](https://rmarkdown.rstudio.com) contains very good learning materials on how to get started and how to use most of its features. --- ## R Markdown * An R Markdown document starts with a [YAML](https://en.wikipedia.org/wiki/YAML) header: <pre> --- title: Report on this and that subtitle: FCB 2023 author: Me date: November 10th, 2023 output: html_document --- </pre> --- ## R Markdown * The YAML header allows one to make global formatting choices: <pre> --- title: Report on this and that subtitle: FCB 2023 author: Me date: November 10th, 2023 output: html_document: toc: true toc_float: true --- </pre> --- ## R Markdown * The rest of the R Markdown document is Markdown text with R code specified as follows: ```` ```{r} dat <- read.csv("mydata.csv") ``` ```` * Here the `r` letter refers to the fact that the **code chunk** contains R code. * Code chunks may have options to control how it is processed by the R Markdown engine. For instance, `eval` controls whether the code chunk should be executed: ```` ```{r, eval=FALSE} dat <- read.csv("mydata.csv") ``` ```` * The code chunk option `echo` can hide the source R code from the output. ```` ```{r, echo=FALSE} dat <- read.csv("mydata.csv") ``` ```` --- ## R Markdown * The code chunk option `results` can be used to hide the text output from the code chunk in the processed R Markdown output. ```` ```{r, results="hide"} head(dat) ## output from this line will not be seen ``` ```` * When displaying graphics, options `fig.width`, `fig.height`, `out.width`, `out.height`, `dpi`, `fig.align` and `fig.cap` allow one to control the aspect ratio, output size, resolution, alignment and caption text, respectively. ```` ```{r, fig.height=5, out.height="300px", dpi=100, fig.align="center", fig.cap="In this figure we show axis y as function of axis x."} plot(dat$x, dat$y) ``` ```` --- class: small-table ## R Markdown * Including tables of results can be done storing them as `data.frame` objects and using the function `kable()` from the package [knitr](https://cran.r-project.org/package=knitr) and the code chunk option `results="asis"`. .left-column[ <pre> ```{r, echo=FALSE, results="asis"} library(knitr) kable(head(iris[, 1:3]), caption="Iris data.") ``` </pre> ] .right-column[ <p style="margin-top:-35px"> Table: Iris data. | Sepal.Length| Sepal.Width| Petal.Length| |------------:|-----------:|------------:| | 5.1| 3.5| 1.4| | 4.9| 3.0| 1.4| | 4.7| 3.2| 1.3| | 4.6| 3.1| 1.5| | 5.0| 3.6| 1.4| | 5.4| 3.9| 1.7| ] * The list of all available options can be found [here](https://yihui.org/knitr/options). --- ## R Markdown * Example of an R Markdown document converted into HTML and displayed on the browser (default sample R Markdown document in RStudio). .left-column[ ![:scale 90%](data:image/png;base64,#img/examplermd.png) ] .right-column[ ![:scale 90%](data:image/png;base64,#img/exampleknitrmd.png) ] --- ## Concluding remarks * The lack of reproducibility of results hampers the advancement of science. * Reproducibility in data analysis is a minimum standard towards the end goal of full replicability. * There is a rich software toolkit for reproducible research: literate programming version control systems, data and code repositories, unit testing, containerization, etc. * There is an increasing demand for scientists and professionals with computational skills for reproducible data analysis. -- ![:scale 50%](data:image/png;base64,#img/keithbaggerlyquote.png)