Join the community!
Visit our GitHub or LinkedIn page to join the Tilburg Science Hub community, or check out our contributors' Hall of Fame!
Want to change something or add new content? Click the Contribute button!
Overview
Many R-users rely on the dplyr
or read.table
packages to import their datasets as a dataframe. Although this works well for relatively small datasets, we recommend using the data.table
R package instead because it is significantly faster. This building block provides you with some practical tips for dealing with large datsets in R.
Code
As a starting point, make sure to clean your working environment in RStudio. Oftentimes, there are datasets stored memory that you have worked with earlier but you’re no longer using. Click on the broom icon in the top right window to remove all objects from the environment.
In addition, switching from the read.csv()
function to fread()
can greatly improve the performance of your programme in our experience. Below we illustrate how you can import a (subset of the) data, determine the object size, and store the derivative version of the file for future use.
# import package
library(data.table)
# import data with data.table package
df <- fread(<YOUR_DATASET.csv>)
# only import the first couple of rows for exploratory analysis
df <- fread(<YOUR_DATASET.csv>, nrows=500)
# only import the data you actually use
df <- fread(<YOUR_DATASET.csv>, select=c(1, 2, 5)) # column indices
df <- fread(<YOUR_DATASET.csv>, select=c("date", "country", "revenue")) # column names
# print object size in bytes (for a quick comparison)
object.size(df)
# store the derivative file for future use
fwrite(df, <YOUR_CLEANED_DATSET.csv>)