Part 9: Checklist to Audit Data- and Computation-intensive Projects

4 mins

There is quite some material to cover to make sure your workflows become efficient, reproducible, and well-structured.

Here's a checklist you can use to audit your progress.

------------------------------------------------------------------------:--------------::-----------::-----------::-------:
Create subdirectory for source code:
/src/[pipeline-stage-name]/
Create subdirectories for generated files
in /gen/[pipeline-stage-name]/: temp, output, and audit.
Make all file names relative, and not absolute
(i.e., never refer to C:/mydata/myproject,
but only use relative paths, e.g., ../output)
Create directory structure
from within your source code, or use .gitkeep
Have a makefile
Alternatively, include a readme with running instructions
Make dependencies between source code and
files-to-be-built explicit, so that make
automatically recognizes when a rule does
not need to be run
(properly define targets and source files)
Include function to delete temp, output files,
and audit files in makefile
Version all source code stored
in /src (i.e., add to Git/GitHub)
Do not version any files in /data and /gen
(i.e., do NOT add them to Git/GitHub)
Want to exclude additional files (e.g., files that (unintentionally)
get written to /src? Use .gitignore for files/directories
that need not to be versioned
Have short and accessible variable names
Loop what can be looped
Break down "long" source code in subprograms/functions,
or split script in multiple smaller scripts
Delete what can be deleted (including unnecessary
comments, legacy calls to packages/libraries, variables)
Use of asserts (i.e., make your program crash if it
encounters an error which is not recognized as an error)
Tested on own computer (entirely wipe
/gen, re-build the entire project using make)
Tested on own computer (first clone to new
directory, then re-build the entire project using make)
Tested on different computer (Windows)
Tested on different computer (Mac)
Tested on different computer (Linux)

Warning

Versioned any sensitive data?

Before making a GitHub repository public, we recommend you check that you have not stored any sensitive information in it (such as any passwords). This tool has worked great for us: GitHub credentials scanner.