Principles of Project Setup and Workflow Management

Motivation

When working on a project, most of us spend time thinking about what to create (e.g., a cleaned data set, a new algorithm, an analysis, and/or a paper and corresponding slides), but not about how to manage its creation. This results in two major issues, as outlined next.

Major issues in managing data-intensive projects

we may lose sight of the project (“directory and file chaos”)
- Gradually, we write code and edit data sets, and put those files somewhere on our computer.
- When we update files, we either overwrite them, or save new versions under different file names (e.g., including dates like 20191126_mycode.R or version numbers like mycode_v2.R).
- Even with best intentions to keep everything tidy, months or years of working on a project will very likely result in chaos.
we may find it difficult to (re)execute the project (“lack of automation”)
- The way you have set up your project may make it cumbersome to execute the project. Which code files to run, which not? How long does it take for a code file to complete running?
- For example, you wish to re-do your analysis on a small subset of the data (either for prototyping, or as a robustness check), or you would like to try out new variables or test whether a new package provides speed gains…
- However, re-running your project takes a lot of time - if at all you remember how to “run” the various code files you have written.

The primary mission of managing data- and computation-intensive projects is to build a transparent project infrastructure, that allows for easily (re)executing your code potentially many times.

Guiding Principles

The objectives of this tutorial are:

learn how to organize and track the evolution of your projects (e.g., by means of a proper directory structure, and code versioning)
learn how to automize your workflows, and make them reproducible (e.g., by using automation)
learn how to work on projects with others (e.g., by means of Git/GitHub)
learn how to document datasets and workflows
learn how to write clean code (e.g., see our Building Blocks)

Gradual Implementation

Tip

Gradually implement our suggestions.

We may sometimes sound a bit dogmatic (e.g., you must do this or that). Some of our instructions will only make sense to you after a while. So, please stick with our recommendations during the course of your first few projects. Later on, take the liberty to re-design the workflows to your needs.
Consider adopting our suggestions gradually.
1. Start with a proper directory structure on your local computer, which may already be sufficient to start collaborating. For example, do you need feedback from your advisor? Just zip (the relevant parts of) your pipeline and use SURF’s filesending service (for researchers affiliated with Dutch institutions) to send it!
2. Start automating (parts of) your pipeline
3. Document your project and raw data
4. Start to track changes to your source code, and clean up your source/“do your housekeeping” regularly

Warning

Uhh, you just suggested to send an email, really?!

Indeed, email is not what we want to advocate.
But then again, we want you to get started with managing your workflows right away, and adhering to the directory structure outlined above already increases your efficiency.
So, before you proceed to the future chapters of this guide, sit back, and relax, and keep on using good old email.

Configure your Computer

Think your machine is already configured well? Then proceed directly to the next page.

Tip

Configure your computer.

Note that to implement our workflow suggestions, your computer needs to be configured properly - so we suggest you to do that first.
Of course, you need not to install all software tools - but pick at least your statistical software package (e.g., we use R, but others prefer Stata), Python, and make.

Of course, there are many ways to set up a machine, and we do not mean to consider our way of doing to be perfect. In fact, the setup instructions sometimes resemble a compromise, so that later instructions - given to both Windows, Mac and Linux users - can be followed as closely as possible.

If you are writing your Master thesis at Tilburg University, please attempt to install necessary software and packages prior to your first meeting.

Summary

If everything goes smoothly, you should be able to complete the installation in one sitting within 60-120 minutes.
Please follow the steps one-by-one in the order they appear on the side bar and do not deviate from them, unless you really know what you are doing.
Where necessary, we have provided instructions for Mac, Windows and Linux machines.

Warning

We will use the terms command prompt (Windows) and terminal (Linux, Mac) interchangeably.

Suggest changes to this page

Continue reading Pipelines and Project Components Next

A Reproducible Research Workflow with AirBnB Data

A platform-independent, reproducible research workflow with AirBnB data, using Stata, Python and R.

tisem

airbnb

template

workflow

Pipelines and Project Components

Let's break down a project into its most basic parts, which we call pipeline and components.

pipeline

project

components

stages

A Reproducible Workflow Using Snakemake and R

A template for a reproducible research project that uses Snakemake and the R programming language.

snakemake

template

workflow

Principles of Project Setup and Workflow Management

Project Setup Overview

Pipelines and Project Components

Data Management and Directory Structure

Automating your Pipeline

Documenting Datasets

Documenting Source Code and Pipeline Workflows

Versioning using Git and GitHub

Collaborating using GitHub

Checklist to Audit Data- and Computation-intensive Projects

Motivation

Guiding Principles

Gradual Implementation

Configure your Computer

Related Posts

A Reproducible Research Workflow with AirBnB Data

Pipelines and Project Components

A Reproducible Workflow Using Snakemake and R

Principles of Project Setup and Workflow Management

Motivation

Guiding Principles

Gradual Implementation

Configure your Computer

Related Posts

A Reproducible Research Workflow with AirBnB Data

Pipelines and Project Components

A Reproducible Workflow Using Snakemake and R

Google Analytics (functional)

Google Tag Manager (functional)