Automating Workflows
Up to this point, you should have created the following five R scripts:
| File | Description | Phase |
|---|---|---|
download.R | Downloads the data from Inside Airbnb and stores as csv format | Input |
clean.R | Preprocesses the raw data into an aggregated format ready for analysis and visualisation | Transformation |
pivot_table.R | Create a pivot table for the number of reviews by region across time | Transformation |
plot_all.R | Create a line chart for the total number of reviews in a city across time | Output |
plot_Amsterdam.R | Create a line chart for the number of reviews for the top 3 neighborhoods in Amsterdam | Output |
As you've worked through the set of exercises, you've repeatedly built on preliminary results. For example, the plot for the top 3 cities in Amsterdam (plot_Amsterdam.pdf) could only be created once the pivot_table.csv file had been generated. In a similar way, the preprocessing pipeline (clean.R) could only take place once the data (listings.csv & reviews.csv) had been downloaded. These dependencies have been depicted in the figure below.

Revisit the study notes on "Automating your Pipeline" and write a makefile that captures the end-to-end process (from download.R to plot_all.pdf & plot_Amsterdam.pdf). Also, add an all and clean phony target.
Exercise
Swap the url_listings and url_reviews for a historical dataset of Amsterdam from the previous year (gather the links from the "show archived page"). Run make again in the root directory.
Do the same for a recent Airbnb dataset from New York. If done correctly, it should not take more than a minute (power to automation!). Do your workflows still run as expected? How about the plot_Amsterdam.R file? Why is that?