Reproducible data analysis

The primary goal of reproducible data analysis is to ensure computational reproducibility — that is, the ability of another researcher to use one’s code and data to independently obtain identical results. However, the fact that a particular analysis is reproducible does not mean that it is correct.  This is related to the oft-noted distinction in statistics between reliability (which refers to the consistency or precision of the results) and validity (which refers to the accuracy of the results).  Thus, a second goal of reproducible data analysis is to ensure that the results generated by the code are correct. An important aspect of this is to validate the code using software development practices that prevent errors and software testing methods that can help detect them when they occur.

An additional goal is to allow tracking of the provenance of any particular result (figure, table, or other result), which refers to the process through which that result was obtained.  One way to think of this is to imagine the path from a particular result described in a manuscript back to the original data, describing all of the code as well as any intermediate results that are used in generating the published result.  If you have ever struggled to reconstruct exactly how a figure was generated after publishing a paper, then you understand the importance and challenge of provenance tracking.

Committing to reproducibility

There are a number of coding practices that will help enhance reproducibility. While it takes a bit of time to learn and implement these practices, they will pay off in the long run. As the Sofware Carpentry authors say:

The good news is, doing these things will speed up our programming, not slow it down. As in real carpentry — the kind done with lumber — the time saved by measuring carefully before cutting a piece of wood is much greater than the time that measuring takes. https://swcarpentry.github.io/python-novice-inflammation/10-defensive/index.html

There are several global decisions that one can make that will make it easier to work in a reproducible way.  These are admittedly opinionated views that may be differently applicable in different research domains.

  • Use only free/open-source software whenever possible. This makes it easier for anyone else to reproduce your work without needing to buy particular software.

    • In some fields, commercial software platforms are standard.  When free/open source alternatives are available (e.g. Octave for MATLAB), try to ensure that code is compatible with those alternatives whenever possible.

  • Minimize manual analysis steps.  Any analysis operation that can be automated should be automated, using some form of script.  Never perform manual data reorganization or file renaming operations.

  • Commit to a standard organization scheme.  Using a standard scheme may sometimes require a bit more work in the short term, but will have significant payoffs in the longer term.

  • Commit to improving your skills as a software developer.  Software development is a set of skills that must be learned, just like a new spoken language or musical instrument.  The best way to advance one’s skills is through consistent and deliberate practice.  Commit to reading the materials on software engineering that are outlined in the Resource section below, and working to implement those techniques in your coding practices.

Prerequisites

In order to get started with reproducible data analysis, you will need to understand several topics:

  • Version control (described in the section on Code Sharing)

  • Using the command line (described in the section on Basic Skills)

Getting started

Step 1: Create a reproducible environment

The environment comprises all of the software components that are necessary to perform a particular operation.  This includes the code and data as well as any dependencies (such as software libraries) that are necessary to run the code.

There are two levels of reproducibility that one might shoot for in their software environment.  First is the ability to reproduce the environment on one’s own system.  This can be achieved by generating a virtual environment, which is a configuration of one’s system that can be loaded or unloaded as needed. For both Python, one can use the Anaconda software distribution to create and manage virtual environments; one can also use Anaconda to install and manage virtual environments for R, but this can cause problems for RStudio users.  For Rstudio users, a better solution is the renv package, which allows one to snapshot and restore packages.

The second level of reproducibility is to allow someone else to implement the same environment on a different computer.  There are two ways to potentially address this.

  1. One can simply record all of the dependencies that are installed on their system. This will allow someone else with exactly the same operating system to reproduce the development environment, but will not ensure that operating system libraries are the same across the systems.  This is important because there is evidence that analytic results can vary across operating systems.

    • Anaconda: conda list prints all installed packages and versions.

    • Python: pip freeze prints all installed packages and versions.

    • R: Use installed.packages, devtools::package_info or renv to create a snapshot of current package versions

  2. A generally more effective way to allow someone else to reproduce one’s environment is to use containers.  A container is a system that emulates a virtual computer within one’s own computer, which is fully configurable and allows one to almost exactly reproduce an environment across different computers.  The most commonly used system for containerization is Docker; a related system called Singularity is used on shared computer systems such as clusters.  See the FAQ below for more on how to set up and use Docker and Singularity.

Step 2: Use version control for everything

  • All work done on a computer for a project should be tracked using a version control system, including file name changes and reorganization.  This allows the history of all code and operations to be tracked.

  • Develop a strategy for committing your changes, and stick to it.

  • One good strategy from this blog post to commit:

    • When you complete a unit of work.

    • When you have changes you may want to undo.

Step 3: Code Understandably

Your code should be understandable to a reader who is familiar with the language, preferably without the need for many comments in the code.  This isn’t just for other people – it’s also for you when you look back at your code in the future.

  • Practice literate programming whenever possible. Literate programming is an approach to coding in which the logic of the program is explained in natural language alongside the code.  See the section on Reproducible Manuscripts for more on this.

  • Use the Pseudocode Programming Procedure to plan your code. Before you start writing code for a project, first plan out the structure of the code using natural language descriptions of its structure (such as the various functions and classes).  The pseudocode can then be retained as comments, or as a first step towards a literate program.

  • Use understandable variable names.

  • Don’t embed “magic numbers” in your code.  Inserting numbers into the code can cause problems if those values need to change in the future.  Always define clearly named variables to contain any specific value.

  • Use comments sparingly.

  • Document functions and classes. Describe all arguments and return values.

  • Follow language-specific code conventions when available. This will make it much easier for readers to parse your code.

  • Use a code analyzer.  If available, a code analyzer can identify errors or potential problems in one’s code, in addition to identifying compliance with style conventions.

Step 4: Code defensively

Assume that errors will occur, and create code that will be robust to those errors and/or will call attention to them when they exist.

  • Use assertions. Assertions are statements that will cause a program to signal an error if a particular condition is false. These should be used to detect problematic conditions.

    • Example: If a number is meant to be a probability, then its value must be within the range [0, 1]:

    • Python: assert p \>= 0 and p \<= 1

  • Write tests for important functions.  Tests can ensure that a function performs properly, and can test for its ability to handle potential errors.

    • For data analysis functions, it can be useful to generate dummy data and assert that your function returns the expected outcome when applied to the dummy data, and that it also loudly fails when given inappropriate inputs

    • See the section on Software Testing for more.

  • Be aware of random seeds.  In any code that uses random number generators (RNG), the results may vary depending upon the initialization of the RNG.  By default, the random seed is usually set based on the current time when the RNG is first used in a session, which will give a different set of random numbers each time the program is run.

    • One can explicitly set the random seed in order to ensure that the same random numbers are used each time the program is run.

      • In some cases this is useful, but it could also be problematic, as it limits the generalizability of the results to that particular random seed. Any results should be confirmed using multiple random seeds.

    • A good solution is to obtain a random number using the default (time-based) seed, use this number as the random seed but also store it for later reference. This would allow a user in the future to re-run the program using exactly the same seed, which should allow exact reproducibility even with random numbers..

Step 5: Code portably

Assume that your code will need to run on other computers, so that details about your specific computer should not be stored in the code itself.

  • Never use absolute paths such as /Users/username/code/projectname - Instead, specify a base path that can be modified in one place, and then use that base path to create all other paths.

  • Never store credentials or secrets in code.

    • Cybercriminals regularly check github for credentials (such as access keys for cloud computing systems like AWS), and may use those keys to commit crimes

    • In addition, these secrets may need to change for portability across different systems.

  • Use configuration files or environment variables

    • Instead of storing details about the system in code, store them in a configuration file or an environment variable.

    • Be sure not to check in the configuration file to your git repository!

      • Add the name of the configuration file to your .gitignore file, which will prevent it from being seen by git.

Step 6: Automate your workflow

  • The goal of automation is to allow one to issue a single command that can execute the entire workflow.

  • Workflow automation brings many benefits.

    • It allows one to easily rerun an entire workflow if there are changes to the input data or preprocessing.

    • It saves the researcher time, because they don’t have to recreate a complex set of steps to run an analysis.

    • It ensures that the entire path of the analysis, from raw data to final results, is documented and the provenance of the final results are clear.

    • It makes it much easier to implement analysis approaches (such as multiverse analyses) that require running the workflow many times with different parameters, or resampling methods that rerun the workflow repeatedly on subsamples of the data.

  • There are many different tools for workflow automation.

    • Some are domain-specific

  • A simple general-purpose tool for workflow automation is UNIX make

    • See the section on Workflow Automation for more.

Advanced steps

Step 7: Containerize your workflow using Docker

Docker allows one to generate a system with a single environment that can be reused by anyone, which contains specific versions all the dependencies that an analysis, or other code, needs to run.  In essence, Docker creates a virtual Linux-based computer running inside your computer.

  1. Install the Docker Desktop software for Mac and most recent Windows

    • If you are working on a shared system such as a cluster, you will still need to use Docker on your local machine to generate the container, since creating a Docker container requires administrative privileges which are not available on a shared system.

  2. If you are not familiar with Docker, complete the first three sections of A Docker Tutorial for Beginners (through the “Hello World” section)

  3. Create a Dockerfile that contains all of your necessary dependencies

    • This often requires trial and error, as it may not be immediately obvious which libraries are necessary and/or how to find them for Linux

      • Note: creating a Dockerfile requires one to understand how to install packages on a Linux system.  If you are not comfortable with installing packages from the command line using apt-get, here is an introduction.

    • TBD: find or create a good tutorial on creating a dockerfile, using the ubuntu base image

      • Be sure to specify versions of all packages, as described below.

  4. Build your container

    • It can be useful to set up a Makefile containing your docker commands so that you can build the container using a simple command like make docker-build

  5. Run your analysis within a Docker container

  6. Push your container to DockerHub

    • This will allow others to use your container without the need to build it themselves. It also allows you to refer to a specific fixed version of the container, so that others can use exactly the same version

Step 8: Run your workflow automatically using continuous integration

  • Continuous integration systems (such as (Github Actions)[https://docs.github.com/en/actions] or CircleCI) provide the ability to automatically run a particular code whenever a new commit is pushed to a Github repository.

  • These systems generally allow some amount of free testing for public open source projects.

  • This is primarily meant to be used for software testing, but can also be leveraged to actually run scientific data analyses in a fully reproducible way.

    • The results of the analysis can be exposed as “artifacts” and uploaded to a data repository to be shared.

  • Examples:

Frequently Asked Questions

How do I set up a Python environment:

How do I set up an R/RStudio environment?

How do I create a Virtual Environment?

  • Within Anaconda you can create virtual environments for Python and R, in which you install a particular set of dependencies.  This can be useful if you have different projects in which you may need to install different software versions.

  • To create a virtual environment with a particular python version:

    • conda create -n myprojectenv python=3.8

  • To activate the environment:

    • conda activate myprojectenv

How can I install a specific set of package versions in R?

  • In R it is difficult to  specify the package version being used, especially for package dependencies..

  • One way to achieve exact version reproducibility is to use the Checkpoint package for R (https://cran.r-project.org/web/packages/checkpoint/index.html).  This package allows you to specify a particular date, and then uses all of the package versions that are current as of that date:

    library(checkpoint)
    checkpointDir \<- '/checkpoint'
    checkpoint("2019-08-13", checkpointLocation = checkpointDir)
    
  • It is also important to record the package versions that are used so that they can be reported and/or shared with the results upon publication (now required by some publishers such as Nature).  You can use the package_info() function from devtools to get this information and save it to a text file:

    library(devtools)
    package_info \<- devtools::package_info()
    write.table(package_info, file=paste(‘R_package_info.txt)
    

How can I install a specific version of a Python package?

  • You can install specific versions of python packages using:

    • pip (e.g. pip install MySQL_python==1.2.2)

    • conda (e.g. conda install scipy=0.15.0).

What is Singularity?

Singularity provides the ability to use containers like Docker without requiring adminstrative access, so that it’s usable with high performance computing clusters (HPCs). Recent versions of singularity can use images directly from DockerHub, making it very easy to implement once an image has been pushed to DockerHub.

How should I structure my repository?

  • See the section on Code Sharing for more on how to structure a repository.