## Using Dask to Describe Data

We learned how to Handle Large Datasets in Python in a general way, but now let's dive deeper into it by implementing a practical example. To illustrate how to use Dask, we perform simple descriptive and analytics operations on a large dataset. We use a flight dataset, available at Kaggle.com, containing over 5M flight delays and cancellations from 2015. When loaded with `pandas`

, the dataframe occupies more than 1.3GB of memory, which would make operations rather slow. Let's load the data with `dask`

. For all operations we also provide the equivalent `pandas`

code for comparison.

### Loading the data

The first thing we need to do is import the `dask`

library and load the data. Make sure that the downloaded data is in the same working directory as the Python file of the code. In Dask, we do this as follows:

```
# import the Dask DataFrame module
import dask.dataframe as dd
#read the downloaded csv file
flights = dd.read_csv('flights.csv', assume_missing=True, dtype={'CANCELLATION_REASON': 'object'})
```

The equivalent `pandas`

code would be:

```
# import the pandas module
import pandas as pd
#read the downloaded csv file
flights = pd.read_csv('flights.csv')
```

The arguments `assume_missing`

and `dtype`

are added because it is likely that `dask`

will raise a *ValueError* due to its attempts to infer the types of all columns by reading a sample of the data from the start of the file. This attempt is sometimes incorrect, thus it may require manually specifying the type each column should have (with `dtype`

) or assuming all integer columns have missing values and request their conversion to float (with `assume_missing`

).

If you don't mention these arguments when reading the data and it raises the error, Python will suggest a possible solution to fixing it.

### Inspecting the data

Now that we loaded the data, let's see what it looks like and find some basic info about it. We can check to see how many partitions the data has been divided in and display the data using the following commands:

Since `pandas`

does not partition a dataset there are no equivalent commands.

We can see the column names and shape of the dataframe too. In `dask`

:

```
#display dataframe columns
flights.columns
#check the shape of dataframe
flights.shape #outputs delayed object
flights.shape[0].compute() #calculates the number of rows
```

Note that the output of the `.shape()`

method in `dask`

doesn't immediately return output.
Instead, it gives this: `(Delayed('int-4b01ce40-f552-432c-b591-da8955b3ea9c'), 31)`

.
This is because `dask`

uses lazy evaluation - it delays the evaluation of an expression until its value is needed.
We use `.compute()`

to evaluate the expression.
Why the delay? It's because to count the number of rows, `dask`

needs to work through each partition and sum the number of the rows in each one.
More information on lazy evaluation is available in our building block Handle Large Datasets in Python.

If we wanted the column names and shape via `pandas`

:

In `dask`

the `info()`

function doesn't give us the output we know from `pandas`

, but instead only gives the number of columns, the first and last column name and dtype counts. Alternatively, we can do the following to get each dtype and non-null/null counts:

```
#show types of columns
flights.dtypes
#count non-null
flights.notnull().sum().compute()
#alternatively, count null
flights.isna().sum().compute()
```

To achieve the same thing in `pandas`

we'd write the following:

### Descriptive statistics, Grouping, Filtering

We want to investigate the delays, more specifically, what is the biggest delay? What airline has the largest amount of summed delays? Is there a difference in average arrival delay between certain periods of the year?

Let's start with descriptive statistics to get a grasp of the data.
Starting with `dask`

:

```
#compute descriptive statistics for the whole dataframe
flights.describe().compute()
#alternatively, for just one column
flights['ARRIVAL_DELAY'].describe().compute()
```

Which is similar to how we'd work in `pandas`

.

```
#compute descriptive statistics for the whole dataframe
flights.describe()
#alternatively, for just one column
flights['ARRIVAL_DELAY'].describe()
```

If we wanted to find the largest departure delay, we'd need to use the `max()`

command along the horizontal axis. In `dask`

:

Or, equivalently in `pandas`

:

Next, we may be interested in which airline has the largest amount of minutes delay in the data. We can find this by computing the sum of the airline delays grouped by airlines.

```
#sum the total delay per airline with groupby
flights.groupby(by = 'AIRLINE')['AIRLINE_DELAY'].sum().compute()
```

Or in `pandas`

:

```
#sum the total delay per airline with groupby
flights.groupby(by = 'AIRLINE')['AIRLINE_DELAY'].sum()
```

We might also want to know if the average delay is different across months. For example, delays could be higher in winter due to snowstorms and inclement weather. We start by first checking if the data contains delays for all months of the year. Then, we compute the mean arrival delays by grouping them per month.

```
#count the unique values for the "month" column to check we get 12
flights['MONTH'].nunique().compute()
#group the average arrival delay per month
flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().compute()
#take the maximum average delay
flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().max().compute()
```

The same operations would be executed in `pandas`

as follows:

```
#count the unique values for month column to check we get 12
flights['MONTH'].nunique()
#group the average arrival delay per month
flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean()
#take the maximum average delay
flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().max()
```

### Variable creation and plotting

When plotting large datasets, we can either use packages that can handle many data points, or bring the data that we want to plot to a smaller shape that fits the memory. For this simple example we want to plot the count of airlines per month, which is not a large dataframe, so we can just use the classic `matplotlib`

to make a graph.

Thus, in our answer to the question "Is there a difference in average arrival delay between certain periods of the year?", we can visualize the average delays for each month with a plot:

```
#create new dataframe
monthly_delay = flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().reset_index()
#converting the dataframe from dask to pandas
monthly_delay = monthly_delay.compute()
#convert month column to int type
monthly_delay['MONTH'] = monthly_delay['MONTH'].astype(int)
#create bar plot
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (10, 7))
# creating the bar plot
plt.bar(monthly_delay["MONTH"], monthly_delay["ARRIVAL_DELAY"], color ='orange',
width = 0.4)
plt.xlabel("Month")
plt.xticks(range(monthly_delay['MONTH'].nunique()+1)) #display all values on x axis
plt.ylabel("Average arrival delay")
plt.title("Average arrival delay per month")
plt.grid() #display grid
plt.rcParams.update({'text.color': "white",
'axes.labelcolor': "white",
'xtick.color': "white",
'ytick.color': "white" ,
'font.size': 13}) #change text color and size for better readability
plt.show()
```

*Output*:

As such, we can see that the highest average arrival delay is registered in June, with the second highest in February, while September and October register negative values, which means that on average, arrivals were ahead of schedule.

`dask`

is a Python library suitable to use when dealing with larger datasets that don't fit the memory. It uses lazy evaluation, which means it doesn't execute operations or commands until actually necessary. Most of the functions are the same as in `pandas`

, but remember to add `.compute()`

after them if you want `dask`

to compute the result at that point in your code.

## See also

For more resources on `dask`

check out these links: