Checklist to Audit Data- and Computation-intensive Projects
There is quite some material to cover to make sure your workflows become efficient, reproducible, and well-structured.
Here's a checklist you can use to audit your progress.
|At the project level|
|Implement a consistent directory structure: data/src/gen|
|Include readme with project description and technical instruction how to run/build the project|
|Store any authentication credentials outside of the repository (e.g., in a JSON file), NOT clear-text in source code|
|At the level of each stage of your pipeline|
|Create subdirectory for source code:
|Create subdirectories for generated files in
|Make all file names relative, and not absolute (i.e., never refer to C:\mydata\myproject, but only use relative paths, e.g., ../output)||☐||☐||☐||☐|
|Create directory structure from within your source code, or use .gitkeep||☐||☐||☐||☐|
|Automation and Documentation|
|Alternatively, include a readme with running instructions||☐||☐|
|Make dependencies between source code and files-to-be-built explicit, so that
|Include function to delete temp, output files, and audit files in makefile||☐||☐||☐||☐|
|Version all source code stored in
|Do not version any files in
|Want to exclude additional files (e.g., files that (unintentionally) get written to
|Have short and accessible variable names||☐||☐||☐||☐|
|Loop what can be looped||☐||☐||☐||☐|
|Break down "long" source code in subprograms/functions, or split script in multiple smaller scripts||☐||☐||☐||☐|
|Delete what can be deleted (including unnecessary comments, legacy calls to packages/libraries, variables)||☐||☐||☐||☐|
|Use of asserts (i.e., make your program crash if it encounters an error which is not recognized as an error)||☐||☐||☐||☐|
|Testing for portability|
|Tested on own computer (entirely wipe
|Tested on own computer (first clone to new directory, then re-build the entire project using
|Tested on different computer (Windows)||☐||☐||☐||☐|
|Tested on different computer (Mac)||☐||☐||☐||☐|
|Tested on different computer (Linux)||☐||☐||☐||☐|
Versioned any sensitive data?
Before making a GitHub repository public, we recommend you check that you have not stored any sensitive information in it (such as any passwords). This tool has worked great for us: GitHub credentials scanner.