Automating Workflows
Up to this point, you should have created the following five R scripts:
File | Description | Phase |
---|---|---|
download.R | Downloads the data from Inside Airbnb and stores as csv format | Input |
clean.R | Preprocesses the raw data into an aggregated format ready for analysis and visualisation | Transformation |
pivot_table.R | Create a pivot table for the number of reviews by region across time | Transformation |
plot_all.R | Create a line chart for the total number of reviews in a city across time | Output |
plot_Amsterdam.R | Create a line chart for the number of reviews for the top 3 neighborhoods in Amsterdam | Output |
As you've worked through the set of exercises, you've repeatedly built on preliminary results. For example, the plot for the top 3 cities in Amsterdam (plot_Amsterdam.pdf
) could only be created once the pivot_table.csv
file had been generated. In a similar way, the preprocessing pipeline (clean.R
) could only take place once the data (listings.csv
& reviews.csv
) had been downloaded. These dependencies have been depicted in the figure below.
Revisit the study notes on "Automating your Pipeline" and write a makefile
that captures the end-to-end process (from download.R
to plot_all.pdf
& plot_Amsterdam.pdf
). Also, add an all
and clean
phony target.
Exercise
Swap the url_listings
and url_reviews
for a historical dataset of Amsterdam from the previous year (gather the links from the "show archived page"). Run make
again in the root directory.
Do the same for a recent Airbnb dataset from New York. If done correctly, it should not take more than a minute (power to automation!). Do your workflows still run as expected? How about the plot_Amsterdam.R
file? Why is that?