[docker, images, container, environments, virtual]


Why should you use Docker?

When publishing an academic paper, for reproducibility verification reasons, you are usually required to provide the code and data you have obtained your results. The results may work on your computer, but once the code is sent to someone else, it no longer reproduces the analysis results. The reasons behind this can be many (e.g., different software versions, different operating systems). In other cases, there might indeed be an issue. Reporting back the exact problem so that it can be fixed can be somewhat complicated, especially if the reviser and author do not work with a common environment.

Because Docker encapsulates everything that is needed for a project (e.g., Python, R, …) and isolates it into containers (i.e., abstracting it from the host operating system), all of the issues mentioned above are solved. With Docker, you don’t even need to have the program or the same version in which the project works installed to run the workflow. Thus, the β€œbut-it-works-on-my-machine” argument is now either “it works in any machine, or it doesn’t.” Great, right?

Tip

Still haven’t downloaded Docker? Check our building block on how to get Docker up and running.

Docker basics

Dockerfiles

The dockerfile is a text file (no file extension!) containing the code that will be executed to produce a new Docker image, where the container will be run. The basic structure of a dockerfile is the following:

FROM "base image:version"
RUN "additional packages required to run the image"
COPY "add files to be used into the container"
CMD "default command to be executed"
  • The FROM instruction sets the base image upon which the container will be based. A base image is one that has no parent image (e.g., typically an operating system such as ubuntu or debian). Base images can either be:

    • Official images: are maintained and supported by Docker. These are, for instance, ubuntu, python, hello-world. For example, check out some R images here.

    • User images: are created and shared by any user. Usually, they add additional functionality to a base image, and their file names are formatted as user/image-name.

  • RUN adds layers to the base image. It allows to install additional packages needed for your Docker image to run smoothly.

  • COPY allows you to copy your local machine’s directory structure into the image and/or add the necessary files, e.g., scripts.

  • CMD is used to set the default command to be executed when running the container. There can only be one CMD command per Dockerfile. Otherwise, only the last CMD command will run.

    • If you want to define a container with a specific executable, it might be better to use ENTRYPOINT instead of CMD. You can find more information on ENTRYPOINT here.
Tip

Dockerfiles can be written in any text editor, such as Visual Studio Code, which is especially popular when working with Docker because of its official docker extension. This extension has some excellent features such as debugging and auto-completion, making things a lot easier.

Let’s first learn some theory - Docker 101

Once the Dockerfile is ready to go, the first step is to build the image from the Dockerfile. To do so, you will have to make use of the following command in your terminal:

  • docker build -t myname/myimage .

Once the image is built, you can run it in a container by typing in:

  • docker run myname/myimage

These are the two basic steps to creating and running an image inside a container. Now we’re ready to get hands-on experience.

Tip
  • For further information on image and container manipulation, check out this handbook

  • Do you need further help with the basics or want to slow down the pace? Take a look at this Docker curriculum

  • Running CMD in Docker but getting an error message? It could be your execution script cannot be read. Set its file permissions using chmod +x src/run.sh. See also here.

Let’s dockerize a research workflow that runs both Python and R

In many empirical research projects, we can use R, Python, or both. The best thing is that these packages are free (i.e., you can run them anywhere without purchasing a license). In this application, we start from a very basic image (Ubuntu) and then add the required layers of R and Python to the image.

Tip

Want to run a container-based solely on an R image? Check out the following:

Example repository

How can we do this? Let’s create a simple repository that…

  • generates a dataset drawn from a normal random sample using the NumPy library in Python,
  • saves the data in a data directory and obtains a histogram from such data in R (saving it in the gen folder).

All these steps are run with one shell script. Let’s structure our workflow in the following manner.


docker-demo
β”‚
β”œβ”€β”€ code
β”‚   β”œβ”€β”€ packages.R  ....................... necessary packages to run the R script
β”‚   β”œβ”€β”€ r-script.R  ....................... creates a histogram, saves it in \gen folder
β”‚   β”œβ”€β”€ pyscript.py ....................... draws random normal sample, saves it in \data folder
β”‚   β”œβ”€β”€ requirements.txt .................. library needed for pyscript
β”‚   └── run.sh ............................ shell instruction to run both scripts
β”œβ”€β”€ data
β”‚   
└── gen
β”‚
└── Dockerfile

The Scripts

Let’s take a look at the scripts that will carry this out:

#!/bin/sh
python src/pyscript.py
#!/bin/sh
Rscript src/r-script.R
# Create a normal random sample
import numpy as np
mu, sigma = 0, 0.1
np.random.seed(1)
data = np.random.normal(mu, sigma, 1000)
# save it
np.savetxt('data/data.csv', data, delimiter=',', header='X')
# import data
data <- read.csv("data/data.csv")
Z <- as.matrix(data)
# create a histogram and save it to output folder
dir.create('gen')
png("gen/histogram.png")
histogram <- hist(Z)
dev.off()

  • As per packages.R and requirements.txt, the first one installs the MASS package (it’s not really needed and just kept here for a demo…), and the latter installs the NumPy library used in the .py script.

The Dockerfile

# Define base image/operating system
FROM ubuntu:latest

ENV DEBIAN_FRONTEND=noninteractive

# Install software
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
 python3.9 python3-dev python3-pip r-base
RUN ln -s /usr/bin/python3 /usr/bin/python

# Set container's working directory - this is arbitrary but needs to be "the same" as the one you later use to transfer files out of the docker image
WORKDIR /docker_rpy

# Copy files and directory structure to working directory
COPY . .

# Install necessary packages for Python and R
RUN pip3 install -r src/requirements.txt
RUN Rscript src/packages.R

# Run commands specified in "run.sh" to get started
ENTRYPOINT ["sh", "src/run.sh"]

Line-by-line, this Dockerfile instructs the following:

  • Starts from the lastest Ubuntu image available as the base image

  • Suppresses the prompts for choosing your location during the R install

  • Updates the apt-get: to install available updates of all packages currently installed

  • Include essential installation packages for Ubuntu, installing only the recommended (and not suggested) dependencies

  • Downloads python version 3.9 and allows pip to be used to install the requirements

  • Sets the path for python usage inside the container

  • Sets the working directory for the container and copies all files into it, using the same structure as in the local machine

  • Installs necessary python libraries from the requirements.txt file and the necessary R packages from the file packages.R

  • Sets run.sh as the default command to run

Running the container

Let’s first build the image from the current working directory where the Dockerfile is (.) by typing in the following into the terminal:

docker build -t myname/myimage .
  • Here, the -t argument makes sure that you get some good formatting and a native terminal-like experience
  • Remember to also copy-paste/type the . at the end of the command!

Building the image can take a few minutes. Once built, we run the container based on the image we just created by typing in:

docker run -it --rm  -v "PATH on local computer":"container path" myname/myimage
  • The -it argument creates an interactive bash shell in the container. For example, use it like: docker run -it --rm -v "$(pwd)/.:/docker_rpy" myname/myimage (recall we are using the working directory as specified in the dockerfile to ensure we can “take out” any of our generated files.

  • The --rm argument makes sure the container is automatically removed once we stop it.

  • The -v or -volume argument tells Docker which local folders to map to the created folders inside the container (/docker_rpy in this case). This example makes sure that the dataset generated in the pyscript.py script and the histogram created in r-script.R are saved into the data and gen folders in the local machine, respectively. Hence, the resulting directory structure should be the following:

    docker-demo
    β”‚
    β”œβ”€β”€ src
    β”‚
    β”œβ”€β”€ data
    β”‚    └── data.csv ......................... data generated from pyscript.py
    β”œβ”€β”€ gen
    β”‚    └── histogram.png .................... histogram obtained from r-script.R
    β”‚
    └── Dockerfile
    
Warning

Every time you modify or move a file, the container image will have to be rebuilt before rerunning it. Otherwise, the modifications made in the local machine will not be updated into the container.

Additional Resources

  1. Complete handbook on Docker: The Docker Handbook

  2. R script in Docker tutorial

  3. Tutorial on Docker for Data Science

  4. Information on containerisation

  5. Containerizing a Multi-Container JavaScript Application