There is quite some material to cover to make sure your workflows become efficient, reproducible, and well-structured.
Here’s a checklist you can use to audit your progress.
data-preparation | analysis | paper | … | |
---|---|---|---|---|
At the project level | ||||
Implement a consistent directory structure: data/src/gen |
||||
Include readme with project description and technical instruction how to run/build the project |
||||
Store any authentication credentials outside of the repository (e.g., in a JSON file), NOT clear-text in source code |
||||
Mirror your /data folder to a secure backup location; alternatively, store all raw data on a secure server and download relevant files to /data |
||||
At the level of each stage of your pipeline | ||||
File/directory structure | ||||
Create subdirectory for source code: /src/[pipeline-stage-name]/ |
☐ | ☐ | ☐ | ☐ |
Create subdirectories for generated files in /gen/[pipeline-stage-name]/ : temp , output , and audit . |
☐ | ☐ | ☐ | ☐ |
Make all file names relative, and not absolute (i.e., never refer to C:/mydata/myproject, but only use relative paths, e.g., ../output) |
☐ | ☐ | ☐ | ☐ |
Create directory structure from within your source code, or use .gitkeep |
☐ | ☐ | ☐ | ☐ |
Automation and Documentation | ||||
Have a makefile |
☐ | ☐ | ☐ | ☐ |
Alternatively, include a readme with running instructions | ☐ | ☐ | ||
Make dependencies between source code and files-to-be-built explicit, so that make automatically recognizes when a rule does not need to be run (properly define targets and source files) |
☐ | ☐ | ☐ | ☐ |
Include function to delete temp, output files, and audit files in makefile |
☐ | ☐ | ☐ | ☐ |
Versioning | ||||
Version all source code stored in /src (i.e., add to Git/GitHub) |
☐ | ☐ | ☐ | ☐ |
Do not version any files in /data and /gen (i.e., do NOT add them to Git/GitHub) |
☐ | ☐ | ☐ | ☐ |
Want to exclude additional files (e.g., files that (unintentionally) get written to /src ? Use .gitignore for files/directories that need not to be versioned |
☐ | ☐ | ☐ | ☐ |
Housekeeping | ||||
Have short and accessible variable names | ☐ | ☐ | ☐ | ☐ |
Loop what can be looped | ☐ | ☐ | ☐ | ☐ |
Break down “long” source code in subprograms/functions, or split script in multiple smaller scripts |
☐ | ☐ | ☐ | ☐ |
Delete what can be deleted (including unnecessary comments, legacy calls to packages/libraries, variables) |
☐ | ☐ | ☐ | ☐ |
Use of asserts (i.e., make your program crash if it encounters an error which is not recognized as an error) |
☐ | ☐ | ☐ | ☐ |
Testing for portability | ||||
Tested on own computer (entirely wipe /gen , re-build the entire project using make ) |
☐ | ☐ | ☐ | ☐ |
Tested on own computer (first clone to new directory, then re-build the entire project using make ) |
☐ | ☐ | ☐ | ☐ |
Tested on different computer (Windows) | ☐ | ☐ | ☐ | ☐ |
Tested on different computer (Mac) | ☐ | ☐ | ☐ | ☐ |
Tested on different computer (Linux) | ☐ | ☐ | ☐ | ☐ |
Warning
Versioned any sensitive data?
Before making a GitHub repository public, we recommend you check that you have not stored any sensitive information in it (such as any passwords). This tool has worked great for us: GitHub credentials scanner.