[principles, workflow, setup, project, configure]


Motivation

When working on a project, most of us spend time thinking about what to create (e.g., a cleaned data set, a new algorithm, an analysis, and/or a paper and corresponding slides), but not about how to manage its creation. This results in two major issues, as outlined next.

Major issues in managing data-intensive projects

  • we may lose sight of the project (“directory and file chaos”)

    • Gradually, we write code and edit data sets, and put those files somewhere on our computer.
    • When we update files, we either overwrite them, or save new versions under different file names (e.g., including dates like 20191126_mycode.R or version numbers like mycode_v2.R).
    • Even with best intentions to keep everything tidy, months or years of working on a project will very likely result in chaos.
  • we may find it difficult to (re)execute the project (“lack of automation”)

    • The way you have set up your project may make it cumbersome to execute the project. Which code files to run, which not? How long does it take for a code file to complete running?
    • For example, you wish to re-do your analysis on a small subset of the data (either for prototyping, or as a robustness check), or you would like to try out new variables or test whether a new package provides speed gains…
    • However, re-running your project takes a lot of time - if at all you remember how to “run” the various code files you have written.

The primary mission of managing data- and computation-intensive projects is to build a transparent project infrastructure, that allows for easily (re)executing your code potentially many times.

Guiding Principles

The objectives of this tutorial are:

Gradual Implementation

Tip

Gradually implement our suggestions.

Warning

Uhh, you just suggested to send an email, really?!

  • Indeed, email is not what we want to advocate.
  • But then again, we want you to get started with managing your workflows right away, and adhering to the directory structure outlined above already increases your efficiency.
  • So, before you proceed to the future chapters of this guide, sit back, and relax, and keep on using good old email.

Configure your Computer

Think your machine is already configured well? Then proceed directly to the next page.

Tip

Configure your computer.

  • Note that to implement our workflow suggestions, your computer needs to be configured properly - so we suggest you to do that first.
  • Of course, you need not to install all software tools - but pick at least your statistical software package (e.g., we use R, but others prefer Stata), Python, and make.

Of course, there are many ways to set up a machine, and we do not mean to consider our way of doing to be perfect. In fact, the setup instructions sometimes resemble a compromise, so that later instructions - given to both Windows, Mac and Linux users - can be followed as closely as possible.

If you are writing your Master thesis at Tilburg University, please attempt to install necessary software and packages prior to your first meeting.

Summary
  • If everything goes smoothly, you should be able to complete the installation in one sitting within 60-120 minutes.
  • Please follow the steps one-by-one in the order they appear on the side bar and do not deviate from them, unless you really know what you are doing.
  • Where necessary, we have provided instructions for Mac, Windows and Linux machines.
Warning
  • We will use the terms command prompt (Windows) and terminal (Linux, Mac) interchangeably.