Join the community!
Visit our GitHub or LinkedIn page to join the Tilburg Science Hub community, or check out our contributors' Hall of Fame!
Want to change something or add new content? Click the Contribute button!
Many R-users rely on the
read.table packages to import their datasets as a dataframe. Although this works well for relatively small datasets, we recommend using the
data.table R package instead because it is significantly faster. This building block provides you with some practical tips for dealing with large datsets in R.
As a starting point, make sure to clean your working environment in RStudio. Oftentimes, there are datasets stored memory that you have worked with earlier but you’re no longer using. Click on the broom icon in the top right window to remove all objects from the environment.
In addition, switching from the
read.csv() function to
fread() can greatly improve the performance of your programme in our experience. Below we illustrate how you can import a (subset of the) data, determine the object size, and store the derivative version of the file for future use.
# import package library(data.table) # import data with data.table package df <- fread(<YOUR_DATASET.csv>) # only import the first couple of rows for exploratory analysis df <- fread(<YOUR_DATASET.csv>, nrows=500) # only import the data you actually use df <- fread(<YOUR_DATASET.csv>, select=c(1, 2, 5)) # column indices df <- fread(<YOUR_DATASET.csv>, select=c("date", "country", "revenue")) # column names # print object size in bytes (for a quick comparison) object.size(df) # store the derivative file for future use fwrite(df, <YOUR_CLEANED_DATSET.csv>)