Reproducible Research: What to Do When Your Results Can't Be Reproduced

Estimated time:

time

min

Even the most sophisticated machine learning methods, the most beautiful visualizations, and perfect datasets can be useless if you ignore the context of your code execution. In this article, I will describe a few situations that can destroy your reproducibility. Then I will explain how to resolve these problems and avoid them in the future. <h3 id="3-danger-zones">3 danger zones</h3> There are three main areas where things can go wrong if you don’t pay attention: <ol><li>R session context</li><li>Operating System (OS) context</li><li>Data versioning</li></ol> In this blog post, I will focus on the first two areas. They are quite similar because both are related to the software context of research execution. While more and more data scientists pay attention to what happens in their R session, not many are thinking about OS context, which can also significantly influence their research. Data versioning is a huge topic itself and I will cover it in a separate article in the future. For now, let’s examine some examples of where things can go wrong when you have different R session contexts or various Operating System configurations. <h3 id="r-session-context">R session context</h3> The basic elements of an R session context that data scientists should be aware of are: <ul><li>R version</li><li>Packages versions</li><li>Using <code class="highlighter-rouge">set.seed()</code> for a reproducible randomization</li><li>Floating-point accuracy</li></ul> <h3>R version</h3> Not many monitor <a href="https://cran.r-project.org/src/base/NEWS" target="_blank" rel="noopener noreferrer">this R language changelog</a> on a regular basis, but sometimes crucial changes are made and can lead to a failure in your code. One such situation is described <a href="http://stackoverflow.com/questions/32234120/different-results-for-auto-arima-with-d-1-in-r-version-3-2-0-and-r-version-3-2-2/32234317#32234317" target="_blank" rel="noopener noreferrer">here</a>. Change in R(3.2.1) for “nested arima model has higher log-likelihood”, led to an inconsistency of results for arima calculations with d >= 1. <h3>Packages versions</h3> Now, let’s see an example with a changing package version. I am going to use <code class="highlighter-rouge">dplyr</code> in versions <code class="highlighter-rouge">0.4.0</code> and <code class="highlighter-rouge">0.5.0</code>: dplyr 0.4.0 <figure class="highlight"> <pre><code class="language-r" data-lang="r">df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5) result <- df %>% dplyr::distinct(x) %>% sum # result == 8</code></pre> </figure> dplyr 0.5.0 <figure class="highlight"> <pre><code class="language-r" data-lang="r">df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5) result <- df %>% dplyr::distinct(x) %>% sum # result == 3</code></pre> </figure> Wow! Version 0.5.0 introduced functionality-breaking changes and our result is now completely different. set.seed() When your code uses randomization and you want to have reproducible results, it is important to remember about setting a seed. <figure class="highlight"> <pre><code class="language-r" data-lang="r">> sample(1:10, 5) [1] 6 7 2 5 9 > sample(1:10, 5) # Different result [1] 5 7 8 2 3</code></pre> </figure> …and with setting seed: <figure class="highlight"> <pre><code class="language-r" data-lang="r">> set.seed(42) > sample(1:10, 5) [1] 10 9 3 6 4 > set.seed(42) > sample(1:10, 5) # Result is the same [1] 10 9 3 6 4</code></pre> </figure> Random numbers returned by the generator are in fact a sequence of numbers that can be reproduced. Setting a seed informs the generator where to start. <h3>Floating-point accuracy</h3> Pay attention to how expressions on floating-point numbers are implemented. Let’s assume you want to implement an algorithm described in a recently published scientific paper. You get unexpected results in some cases and now try to debug it. And you eventually get this: <figure class="highlight"> <pre><code class="language-r" data-lang="r">> x <- 3 > y <- 2.9 > x - y == 0.1 [1] FALSE</code></pre> </figure> What? It can be solved by testing for “near equality”: <figure class="highlight"> <pre><code class="language-r" data-lang="r">> x <- 3 > y <- 2.9 > all.equal(x - y, 0.1) [1] TRUE</code></pre> </figure> More about floating-point arithmetic can be <a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html" target="_blank" rel="noopener noreferrer">found here</a>. <h3 id="operating-system-context">Operating System Context</h3> When your research is shared with others and you can’t predict which operating system someone is going to use, many things can go wrong. Any difference in one of the following elements: <ul><li>System packages versions</li><li>System locale</li><li>Environment variables</li></ul> can result in different execution results. Many R packages depend on system packages. You can view it in a package DESCRIPTION file under the parameter <code class="highlighter-rouge">SystemRequirements</code>. Let’s look at the example with the <code class="highlighter-rouge">png</code> package and running it on two different operating systems. Package <code class="highlighter-rouge">png</code> DESCRIPTION file: <figure class="highlight"> <pre><code class="language-bash" data-lang="bash">Package: png Version: 0.1-7 ... SystemRequirements: libpng ...</code></pre> </figure> Let’s assume we wrote a custom function <code class="highlighter-rouge">savePlotToPNG</code> that saves a plot to a file. We want to write a unit test for it and the idea is to compare produced result with a previously generated reference image. We generate reference plots on Ubuntu 16.04: <figure class="highlight"> <pre><code class="language-r" data-lang="r">reference.png.path <- file.path("referencePlots", "ref0001.png") savePlotToPNG(reference.png.path)</code></pre> </figure> …and then we run tests on Debian 8 Jessie. We use exactly the same version of <code class="highlighter-rouge">png</code> package: <figure class="highlight"> <pre><code class="language-r" data-lang="r">reference.png.path <- file.path("referencePlots", "ref0001.png") test.plot.path <- tempfile(fileext = ".png") savePlotToPNG(test.plot.path) identical(x = png::readPNG(test.plot.path), y = png::readPNG(reference.png.path)) # Error! libpng system package is different on Ubuntu 16.04</code></pre> </figure> <img src="/blog-old/assets/article_images/2017-03-28-reproducible-research/boom.png" alt="" /> …plots don’t match! This is caused by different <code class="highlighter-rouge">libpng</code> system package versions. <h3>Locale and system variables</h3> Ubuntu: <figure class="highlight"> <pre><code class="language-r" data-lang="r">> x <- c("-110", "-1", "-10", "10", "0", "110", "1") > sort(x) [1] "0" "1" "-1" "10" "-10" "110" "-110"</code></pre> </figure> Windows: <figure class="highlight"> <pre><code class="language-r" data-lang="r">> x <- c("-110", "-1", "-10", "10", "0", "110", "1") > sort(x) [1] "-1" "-10" "-110" "0" "1" "10" "110"</code></pre> </figure> <img src="/blog-old/assets/article_images/2017-03-28-reproducible-research/boom.png" alt="" /> …okayyy :) String sorting in R is based on the locale which is different for Windows and Linux systems. Read more about setting locales <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/locales.html" target="_blank" rel="noopener noreferrer">here</a>. <h3 id="tools-for-r-session-level-reproducibility">Tools for R session-level reproducibility</h3> Data scientists should be aware of the traps described above. These are the foundations of reproducible research. Below there is a list of tools that support managing packages versions: <a href="https://rstudio.github.io/packrat/" target="_blank" rel="noopener noreferrer">Packrat</a> - a popular package that allows you to create and manage a private library attached to your project. <a href="https://github.com/gmbecker/switchr" target="_blank" rel="noopener noreferrer">Switchr</a> - handy abstraction over <code class="highlighter-rouge">.libPaths(){:target="_blank"}</code>, <code class="highlighter-rouge">lib.loc</code> and other lower-level library settings. Helpful in creating and managing libraries. <a href="https://mran.microsoft.com/documents/rro/reproducibility/" target="_blank" rel="noopener noreferrer">Checkpoint</a> - helpful when you need to find older versions of a specific package. <h3 id="tools-for-os-level-reproducibility">Tools for OS-level reproducibility</h3> The best approach is to do your research in an isolated environment, that can be shared as an image. Inside of this environment, you can also use any tool for R session-level reproducibility, like Packrat, Switchr, or Checkpoint. To create an isolated environment you can use virtualization or containerization. There are many tools that can be used for that: <a href="https://www.vagrantup.com/" target="_blank" rel="noopener noreferrer">Vagrant</a>, <a href="https://www.docker.com/" target="_blank" rel="noopener noreferrer">Docker</a>, <a href="https://coreos.com/rkt/" target="_blank" rel="noopener noreferrer">rkt</a>, and many more. I recommend using Docker because it’s easy to use and has good community support. Example structure of your <code class="highlighter-rouge">~/project</code>: <figure class="highlight"> <pre><code class="language-r" data-lang="r">Dockerfile install_libraries.R R/ <-- contains your research R scripts</code></pre> </figure> Let’s look at a sample <code class="highlighter-rouge">Dockerfile</code> that describes the steps necessary for the recreation of an isolated environment: UPDATE: I simplified Dockerfile according to Carl Boettiger's comment. <figure class="highlight"> <pre><code class="language-bash" data-lang="bash">FROM rocker/rstudio:3.3.2 MAINTAINER Paweł Przytuła "join@wordpress.appsilon.com" COPY install_libraries.R /code/install_libraries.R RUN R -f /code/install_libraries.R</code></pre> </figure> <code class="highlighter-rouge">install_libraries.R</code> contains code that reproduces an R environment. It can be either some Packrat code or something like this: <figure class="highlight"> <pre><code class="language-r" data-lang="r">install.packages("devtools") devtools::install_version("dplyr", version = "0.5.0") devtools::install_version("stringr", version = "1.1.0") devtools::install_version("forecast", version = "7.3")</code></pre> </figure> After that it’s enough to run the following command to build an image: <figure class="highlight"> <pre><code class="language-bash" data-lang="bash">docker build -t myimagename:1.0.0 .</code></pre> </figure> When an image is ready you should either save it as a file using <code class="highlighter-rouge">docker save -o myimagename.img myimagename:1.0.0</code> or upload it to <a href="https://hub.docker.com/" target="_blank" rel="noopener noreferrer">Docker Hub</a> - <a href="https://docs.docker.com/engine/getstarted/step_six/" target="_blank" rel="noopener noreferrer">steps described here</a>. Using a previously saved Docker image you can start a container with the following: <figure class="highlight"> <pre><code class="language-bash" data-lang="bash">docker run -d -p 8787:8787 -v ~/project/R:/R myimagename:1.0.0</code></pre> </figure> Because the image is based on <code class="highlighter-rouge">rocker/rstudio</code>, you can access an RStudio session via <code class="highlighter-rouge">http://localhost:8787</code> (login <code class="highlighter-rouge">rstudio</code>, password <code class="highlighter-rouge">rstudio</code>). Your R/ directory will be available inside the running container via <code class="highlighter-rouge">/R</code> path. Awesome, isn’t it?! With a Docker image like this, you have control over the whole software context needed for reproducible research. You can easily extend it with more packages and share it without fear of losing reproducibility. I recommend this approach in any data science project, in which results must be reproduced on more than one machine and also when working in a bigger data science team.