Overview
Using publicly available data from AirBnB (available via Kaggle.com), we illustrate how a reproducible workflow may look like in practice.
We’ve crafted this project to run:
- platform-independent (Mac, Linux, Windows)
- across a diverse set of software programs (Stata, Python, R)
- producing an entire (mock) paper, including modules that
- download data from Kaggle,
- prepare data for analysis,
- run a simple analysis,
- produce a paper with output tables and figures.
How to run it
Dependencies
- Install Python.
- Anaconda is recommended. Download Anaconda.
- check availability: type
anaconda --version
in the command line.
- Install Kaggle package.
- Kaggle API instruction for installation and setup.
- Install Automation tools.
- GNU make: already installed in Mac and Linux OS. Download Make for Windows OS and install.
- Windows OS users only: make
Make
available via the command line.- Right Click on
Computer
- Go to
Property
, and clickAdvanced System Settings
- Choose
Environment Variables
, and choosePath
under the system variables, clickedit
- Add the bin of
Make
- Right Click on
- check availability: type
make --version
in the command line.
- Install Stata.
- making Stata available via the command line. Instruction for adding Stata to path.
- check availability: type
$STATA_BIN --version
in the command line.
- Install Perl.
- Perl is already installed in Mac and Linux OS. Download Perl for Windows OS.
- Make sure Perl available via the command line.
- check availability: type
perl -v
in the command line.
- Install LyX.
- LyX is an open source document processor based on the LaTeX. Download LyX.
- make sure LyX available via the command line.
- check availability: type
$LYX_BIN
in the command line.
Run it
Open your command line tool:
- Check whether your present working directory is
airbnb-workflow
by typingpwd
in terminal- if not, type
cd yourpath/airbnb-workflow
to change your directory toairbnb-workflow
- if not, type
- Type
make
in the command line.
Directory structure
Make sure makefile
is put in the present working directory. The directory structure for the Airbnb project is shown below.
├── data
├── gen
│  ├── analysis
│  │  ├── input
│  │  ├── output
│  │  │  ├── figure
│  │  │  ├── log
│  │  │  └── table
│  │  └── temp
│  ├── data_preparation
│  │  ├── audit
│  │  │  ├── figure
│  │  │  ├── log
│  │  │  └── table
│  │  ├── input
│  │  ├── output
│  │  │  ├── figure
│  │  │  ├── log
│  │  │  └── table
│  │  └── temp
│  └── paper
│  ├── input
│  ├── output
│  └── temp
└── src
├── analysis
├── data_preparation
└── paper
- gen: all generated files such as tables, figures, logs.
- Three parts: data_preparation, analysis, and paper.
- audit: put the resulting log/tables/figures of audit program. It has three sub-folders: figure, log, and table.
- temp : put the temporary files, such as some intermediate datasets. We may delete these filed in the end.
- output: put results, including the generated figures in sub-folder figure, log files in sub-folder log, and tables in sub-folder table.
- input: put all temporary input files
- data: all raw data.
- src: all source codes.
- Three parts: data_preparation, analysis, and paper (including TeX files).