Join the community!
Visit our GitHub or LinkedIn page to join the Tilburg Science Hub community, or check out our contributors' Hall of Fame!
Want to change something or add new content? Click the Contribute button!
Discover Dask, a valuable solution to handle large datasets in Python that provides parallel computing functionalities to popular libraries such as Pandas and Numpy. In this building block, you will be introduced to the package Dask. Moreover, you will learn some of Dask’s fundamental operations that will allow you to handle and work with large datasets in Python in a much more effective and efficient manner.
Memory Errors when working with large datasets
When trying to import a large dataset to dataframe format with
pandas (for example, using the
read_csv function), you are likely to run into
MemoryError. This error indicates that you have run out of memory in your RAM.
Pandas uses in-memory analytics, so larger-than-memory datasets won’t load. Additionally, any operations performed on the dataframe require memory as well.
Wes McKinney - the creator of the Python
pandas project, noted in his 2017 blog post:
pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset
You can check the memory usage of each column of the
pandas DataFrame (including the dataframe’s index) with the following line of code:
pandas only uses a single CPU core to perform computations, so it is relatively slow, especially when working with larger datasets.
One of the solutions to memory errors is to use another library. Here
Dask comes in handy.
Dask is a Python library for parallel computing, which can perform computations on large datasets while scaling well-known Python libraries such as
Dask splits the dataset into a number of partitions. Unlike
Dask partition is sent to a separate CPU core. This feature allows us to work on a larger-than-memory dataset but also speeds up the computations on that dataset.
Dask is included by default in the Anaconda distribution. Otherwise, you can also use pip to install everything required for the most common uses of
Dask or choose to only install the
python -m pip install "dask[complete]" # Install everything python -m pip install dask # Install only core parts of Dask
Alternatively, see other installing options here.
Dask DataFrame is a collection of smaller
pandas DataFrames, split along the index.
The following are the fundamental operations on the
# import the Dask DataFrame module import dask.dataframe as dd # read a single CSV file df = dd.read_csv('/path/example-01.csv') # read multiple CSV files at once df = dd.read_csv('/path/example-*.csv') # check the number of partitions df.npartitions # change the number of partitions df = df.repartition(npartitions=10) # save the Dask DataFrame to CSV files (one file per partition) df.to_csv('/path/example-*.csv') # save the Dask DataFrame to a single CSV file (by converting it to pandas DataFrame first) df.compute().to_csv('/path/example.csv')
Dask DataFrame utilizes a great portion of the
pandas API; therefore, there are a lot of similarities in use. However,
Dask DataFrame doesn’t support all
For the full list of operations, see the Dask Dataframe documentation.
Dask vs pandas
The main difference between
pandas is that in
Dask, each computation requires you to call the
compute() function. This is because
Dask uses so-called lazy evaluation, meaning that the evaluation of an expression will not be executed unless explicitly requested to do so.
For example, this is how you would get column mean in
# get column mean in pandas df['column1'].mean() # get column mean in Dask df['column1'].mean().compute()
To convert a
Dask DataFrame to a
pandas DataFrame, you call the
compute() function on that dataframe:
After that, you continue working with
pandas. You usually want to do this after reducing the large dataset with
Dask (for example, by selecting a subsection) to a manageable level.
The entire dataset must fit in the memory before applying the
Dask Array implements a large subset of
NumPy API, breaking up the large array into many small arrays. You can use
Dask Array instead of
NumPy if you are out of RAM or experiencing performance issues.
You can see the list of functionalities here.
Daskyou can handle and manage efficiently large datasets in Python.
Daskdivides datasets into partitions and distributes the work across multiple CPU cores.
Dask DataFrames is user-friendly for
pandas users due to extensive use of the
DaskDataFrame doesn’t support all pandas functionalities. This makes it an interesting option to Reduce datasets with Dask first, and then switch to pandas for detailed analysis.