[make, makefile, automation, recipes, workflow]


Use this checklist to improve your project's structure

Overview

As time goes on, projects tend to become messy which inhibits reproducibility. Hence, we recommend keeping an eye on this housekeeping checklist from time to time.

Checklist

Project level

  • Implement a consistent directory structure: data/src/gen
  • Include readme with project description and technical instruction how to run/build the project
  • Store any authentication credentials outside of the repository (e.g., in a JSON file), NOT clear-text in source code
  • Mirror your /data folder to a secure backup location; alternatively, store all raw data on a secure server and download relevant files to /data

Throughout the Pipeline

File/directory structure

  • Create subdirectory for source code: /src/[pipeline-stage-name]/
  • Create subdirectories for generated files in /gen/[pipeline-stage-name]/: temp, output, and audit.
  • Make all file names relative, and not absolute (i.e., never refer to C:/mydata/myproject, but only use relative paths, e.g., ../output)
  • Create directory structure from within your source code, or use .gitkeep
  • Create subdirectories for generated files in /gen/[pipeline-stage-name]/: temp, output, and audit.
  • Make all file names relative, and not absolute (i.e., never refer to C:/mydata/myproject, but only use relative paths, e.g., ../output)
  • Create directory structure from within your source code, or use .gitkeep

Automation & documentation

  • Have a makefile
  • Alternatively, include a readme with running instructions
  • Make dependencies between source code and files-to-be-built explicit, so that make automatically recognizes when a rule does not need to be run (properly define targets and source files)
  • Include function to delete temp, output files, and audit files in makefile

Versioning

  • Version all source code stored in /src (i.e., add to Git/GitHub)
  • Do not version any files in /data and /gen (i.e., do NOT add them to Git/GitHub)
  • Want to exclude additional files (e.g., files that (unintentionally) get written to /src? Use .gitignore for files/directories that need not to be versioned

Housekeeping

  • Have short and accessible variable names
  • Loop what can be looped
  • Break down “long” source code in subprograms/functions, or split script in multiple smaller scripts
  • Delete what can be deleted (including unnecessary comments, legacy calls to packages/libraries, variables)
  • Use of asserts (i.e., make your program crash if it encounters an error which is not recognized as an error)

Testing for portability

  • Tested on own computer (entirely wipe /gen, re-build the entire project using make)
  • Tested on own computer (first clone to new directory, then re-build the entire project using make)
  • Tested on different computer (Windows)
  • Tested on different computer (Mac)
  • Tested on different computer (Linux)
Warning

Versioned any sensitive data?

Before making a GitHub repository public, we recommend you check that you have not stored any sensitive information in it (such as any passwords). This tool has worked great for us: GitHub credentials scanner.

See Also

  • This tutorial covers the fundemantal principles of project setup and workflows underlying this checklist.