Overview
Pandas, a Python library built on top of the NumPy library, contains essential tools for anyone working with structured data in Python. In just 10 minutes, get started with pandas basics!
- Import pandas
- Data structures
- Inspect data
- Sort & subset data
- Manipulate data
- Group and reshape data
Import pandas
First, install pandas and numpy, using the Install Python Packages topic. Then import it with the following code:
Data structures in pandas
Pandas has two primary data structures: Series and DataFrame.
Series
The Series is a one-dimensional data structure, consisting of values and an index. You can create a Series using various data types, such as a list or a dictionary.
# List
grades = [5.5, 8.4, 9.6, 7.0, 7.5]
# Create Series from list
series = pd.Series(grades)
series
python
0 5.5
1 8.4
2 9.6
3 7.0
4 7.5
dtype: float64
- Index
The default index is from 0 to n-1, where n is the length of the data. You can change this: for example, if you want the index from 1 to n, or want to have different labels.
# List
grades = [5.5, 8.4, 9.6, 7.0, 7.5]
series = pd.Series(grades, index = [1, 2, "3rd student", 4, 5])
series
python
1 5.5
2 8.4
3rd student 9.6
4 7.0
5 7.5
dtype: float64
- Handling different data types
Series can handle different data types within the same structure. For example, the grades
list now includes a string.
# Including a string in the list
grades = [5.5, 8.4, "highest grade", 7.0, 7.5]
# Create Series
series = pd.Series(grades, dtype = object)
python
0 5.5
1 8.4
2 highest grade
3 7.0
4 7.5
dtype: object
Here, the Series interprets elements as objects to accommodate the different data types (the data includes both floating-point numbers and a string). You can also specify the data type with the dtype
parameter.
Data can also be a one-dimensional ndarray, a Python dictionary, or a scalar value. When creating a Series from a dictionary, the keys become the index and values become the data.
# Dictionary
grade_dict = {"Ellis": 9.6, "John": 8.5, "Casper": 5.3}
# Create Series
series = pd.Series(grade_dict)
series
python
Ellis 9.6
John 8.5
Casper 6.7
dtype: float64
Call values
or index
to see the individual component.
returns array([9.6, 8.5, 5.3])
, and
returns Index(['Ellis', 'John', 'Casper'], dtype='object')
DataFrame
A DataFrame is a two-dimensional data structure in pandas, organized into rows and columns. Each column is essentially a pandas Series. While columns should have the same length, they can hold different data types.
You can create a DataFrame from a dictionary where keys become column names, and values form the data in those columns.
By default, pandas intuitively generates index and column names. However, you can customize these by explicitly specifying them during DataFrame creation with the index
and column
arguments.
# Dictionary
grades = {
'Name': ['Ellis', 'John', 'Casper'],
'Midterm Grade': [9.6, 8.5, 5.3],
'Final Grade': [9, 7, 6]
}
# Create DataFrame from Dictionary
df = pd.DataFrame(grades,
index= ['Student 1', 'Student 2', 'Student 3'],
columns= ['Name', 'Midterm Grade', 'Final Grade'])
df
python
Name Midterm Grade Final Grade
Student 1 Ellis 9.6 9
Student 2 John 8.5 7
Student 3 Casper 5.3 6
Import data from another file
Instead of manually creating a DataFrame in Python, pandas provides convenient functions to import data from various file formats.
read_csv()
: for importing data from CSV filesread_excel()
: for reading data from Excel filesread_html()
: to read data from HTML tablesread_sql()
: for importing data from SQL queries or database tables
To use these functions, provide the file path as an argument. For instance:
Inspect data
To inspect your data, you can use the following functions:
.head()
: Displays the first few rows (default value is 5).info()
: Provides information on each column, such as the data type and the number of missing values..shape
: Returns the number of rows and columns..describe()
: Computes summary statistics for each numerical column, such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.
For instance, to examine the DataFrame df
that was created before:
returns (3, 3)
, indicating that df
has 3 rows and 3 columns.
To explore specific aspects of the DataFrame, you can use df.index
, df.columns
, and df.values
. For example:
returns Index(['Student 1', 'Student 2', 'Student 3'], dtype='object')
Sort & subset data
Sorting data is the process of arranging it in a specific order.
sort_index()
: sort by index values along a specified axis. By default, sort data along the index axis (rows).sort_values()
: sort by values of the column specified with argumentby =
.
python
Name Midterm Grade Final Grade
Student 3 Casper 5.3 6
Student 2 John 8.5 7
Student 1 Ellis 9.6 9
The data is now sorted based on the values in the Final Grade column, with the lowest grade at the top and the highest at the bottom. To sorte in descending order and get the highest grade on top of the DataFrame, set ascending to False
.
python
Name Midterm Grade Final Grade
Student 1 Ellis 9.6 9
Student 2 John 8.5 7
Student 3 Casper 5.3 6
Subset data
Subsetting data involves selecting specific columns or rows from a DataFrame.
Select specific columns
To look at just one column, you can use square brackets []
and specify the column name. For example, let's say we want to look at the "Final Grade" column:
python
Student 1 9
Student 2 7
Student 3 6
Name: Final Grade, dtype: int64
To select multiple columns by passing a list of column names within square brackets:
python
Final Grade Name
Student 1 9 Ellis
Student 2 7 John
Student 3 6 Casper
Select specific rows
To select specific rows, you can use .loc
, and specify the index label between square brackets:
python
Name Ellis
Midterm Grade 9.6
Final Grade 9
Name: Student 1, dtype: object
Boolean indexing
To subset rows/columns based on specific conditions, you can use Boolean indexing. For example, select only the rows where the final grade is greater than or equal to 7.0:
python
Name Midterm Grade Final Grade
Student 1 Ellis 9.6 9
Student 2 John 8.5 7
When subsetting based on multiple conditions, you can use logical operators to combine conditions:
&
(AND): Only rows where both conditions hold true are included in the result.
# Midterm grade higher than 7 AND Final grade is 9
df[(df["Midterm Grade"] > 7) & (df["Final Grade"] == 9)]
python
Name Midterm Grade Final Grade
Student 1 Ellis 9.6 9
|
(OR): Rows are included in the result if either one or the other condition is true.
# Midterm grade higher than 7 OR Final grade is 9
df[(df["Midterm Grade"] > 7) | (df["Final Grade"] == 9)]
python
Name Midterm Grade Final Grade
Student 1 Ellis 9.6 9
Student 2 John 8.5 7
Manipulate data
Add and delete columns
To add a new column, assign values to a new column name. These values can be new, or calculated from the values of other columns in the DataFrame.For example, you want to have a new column with the age of each student, and one column with the mean of the Midterm and Final grade of each student.
The two new columns are added:
python
Name Midterm Grade Final Grade Age Mean Grade
Student 1 Ellis 9.6 9 19 9.30
Student 2 John 8.5 7 22 7.75
Student 3 Casper 5.3 6 21 5.65
To delete the Age column again, you can use .drop()
. axis=1
indicates that it's operating along columns.
python
Name Midterm Grade Final Grade Mean Grade
Student 1 Ellis 9.6 9 9.30
Student 2 John 8.5 7 7.75
Student 3 Casper 5.3 6 5.65
Modifying the original DataFrame
To modify the original DataFrame, you can set inplace = True
. This directly modifies the original DataFrame. Alternatively, you can assign the modified DataFrame to the same variable with
df = df.drop('Age', axis = 1)
. Both approaches achieve the same result.
Transpose data
With .T
behind the DataFrame name, you change the axis: the columns and the indices are switched.
python
Student 1 Student 2 Student 3
Name Ellis John Casper
Midterm Grade 9.6 8.5 5.3
Final Grade 9 7 6
Age 19 22 21
Mean Grade 9.3 7.75 5.65
Handle missing data
Handling missing data is a crucial aspect of data analysis. In any real-world dataset, missing values are common and can impact the accuracy of your analysis. First, we create an example DataFrame with missing values.
data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eva'],
'Age': [25, np.nan, 30, 22, 35],
'Score': [90, 85, np.nan, 78, np.nan]}
df = pd.DataFrame(data)
python
Name Age Score
0 Alice 25.0 90.0
1 Bob NaN 85.0
2 Charlie 30.0 NaN
3 NaN 22.0 78.0
4 Eva 35.0 NaN
- Detect missing data with
.isna().sum()
python
Name 1
Age 1
Score 2
dtype: int64
This displays the number of missing values per column.
- Remove missing data with
.dropna()
This function drops any rows that have missing data.
df.dropna(axis=0)
removes rows with any missing values, and df.dropna(axis=1)
removes columns.
python
Name Age Score
0 Alice 25.0 90.0
Only the first does not contain missing values and is left in the new DataFrame.
python
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
All columns contained missing values, so none are left.
- Filling missing data
Use .fillna()
to fill missing values with a specific value that you specify in between the brackets. For example, fill in the missing values with the mean of each column:
python
Name Age Score
0 Alice 25.0 90.000000
1 Bob 28.0 85.000000
2 Charlie 30.0 84.333333
3 NaN 22.0 78.000000
4 Eva 35.0 84.333333
Or you can use method ='ffill'
or method ='bfill'
to fill missing values with the previous or next valid value. For example:
python
Name Age Score
0 Alice 25.0 90.0
1 Bob 25.0 85.0
2 Charlie 30.0 85.0
3 Charlie 22.0 78.0
4 Eva 35.0 78.0
Group and reshape data
Grouping
Grouping data allows you to split your dataset into groups based on specific criteria and perform operations on each group. Let's consider the following example DataFrame for student exam scores:
data = {
'name': ['Alice', 'Eva', 'Charlie', 'Jack', 'Zara'],
'gender': ['F', 'F', 'M', 'M', 'F'],
'age': ['20', '20', '21', '19', '19'],
'exam grade': [8.4, 9.6, 9.1, 5.8, 6.3]
}
students_df = pd.DataFrame(data)
print(students_df)
python
name gender age exam grade
0 Alice F 20 8.4
1 Eva F 20 9.6
2 Charlie M 21 9.1
3 Jack M 19 5.8
4 Zara F 19 6.3
To group the data based on gender, use the groupby()
function.
# Group by gender
data_gender = students_df.groupby('gender')
# View all entries of group 'F'
data_gender.get_group('F')
python
name gender age exam grade
0 Alice F 20 8.4
1 Eva F 20 9.6
4 Zara F 19 6.3
You can also group based on more than one category, for example, gender and age:
# Group by gender and age
data_gender_age = students_df.groupby(['gender', 'age'])
# View all entries of group 'F' & age '20'
data_gender_age.get_group(('F', '20'))
python
name gender age exam grade
0 Alice F 20 8.4
1 Eva F 20 9.6
Once you have your data grouped, you can perform different operations within each group to extract meaningful information from your dataset. Three common operations are aggregation, transformation, and filtration.
- Aggregation
Aggregation involves combining data within each group to obtain a single value. For example, calculating the average exam grade for each gender:
avg_grade_by_gender = students_df.groupby("gender")["exam grade"].agg("mean")
print(avg_grade_by_gender)
python
gender
F 8.10
M 7.45
Pandas provides various aggregation functions to summarize data within groups. Common aggregation functions are sum
, mean
, min
, max
, median
, count
, std
, first
, and last
.
- Transformation
Transformation applies a function to each group independently. Let's transform the exam grades to represent the difference from the mean grade within each gender group:
grade_difference = students_df.groupby("gender")["exam grade"].transform(lambda x: x - x.mean())
print(grade_difference)
python
0 0.30
1 1.50
2 1.65
3 -1.65
4 -1.80
- Filtration
Filtration allows you to filter groups based on some condition. For example, keeping only groups with a mean exam grade greater than a certain threshold:
ages_above8 = students_df.groupby("age").filter(lambda x: x["exam grade"].mean() > 8.0)
print(ages_above8)
The group based on age that scored a mean exam grade above 8 are age '20' and age '21'.
python
name gender age exam grade
0 Alice F 20 8.4
1 Eva F 20 9.6
2 Charlie M 21 9.1
Pivot tables
Pivot tables allow you to reshape and summarize data and are particularly useful for aggregating and analyzing data based on one or more criteria. Let's create a pivot table for average exam grades considering both gender and age:
pivot_table = students_df.pivot_table(values='exam grade', index=['gender', 'age'], aggfunc='mean')
pivot_table
python
exam grade
gender age
F 19 6.3
20 9.0
M 19 5.8
21 9.1
Multi-level indexing
Multi-level indexing is a feature in pandas that allows you to have more than one column in your DataFrame's index, allowing you to have multiple columns acting as a row identifier.
Let's create an example using the student exam scores DataFrame. The df now contains multiple levels of indexing: in this case gender and age.
python
name exam grade
gender age
F 20 Alice 8.4
20 Eva 9.6
M 21 Charlie 9.1
19 Jack 5.8
F 19 Zara 6.3
You can now perform operations using this multi-level index. For example, let's select data for females:
python
name exam grade
age
20 Alice 8.4
20 Eva 9.6
19 Zara 6.3
This is a simple example, but multi-level indexing becomes extremely useful when dealing with more complex datasets where you want to organize and analyze data hierarchically.
Combine data
pandas provides various methods for combining data.
Concatenate
The concat()
function is for combining two or more DataFrames along a particular axis. Consider the following example:
df1 = pd.DataFrame({'Name': ['Ellis', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['John', 'Zara'], 'Age': [22, 28]})
# Concatenate along the rows (axis=0)
total = pd.concat([df1, df2], ignore_index=True)
total
python
Name Age
0 Ellis 25
1 Bob 30
2 John 22
3 Zara 28
The output is a DataFrame where df2
is appended below df1
. With ignore_index = True
the index is reset to a new continuous range.
Merge
If you are combining datasets based on a common column, use the merge()
function.
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Ellis', 'Bob', 'John']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 22]})
# Merge based on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how= 'inner')
print(merged_df)
python
ID Name Age
0 1 Ellis 25
1 2 Bob 30
Information is combined from both df1
and df2
based on the common ID
column. The on
parameter specifies the column to merge on, and the how
parameter defines the type of merge (here, inner
).
Different types of merges
inner
: Keeps only the rows where the key(s) are present in both DataFrames.outer
: Keeps all rows from both DataFrames and fills in missing falues with NaN for columns that don't have a matching key.left
: Keeps all rows from the left DataFrame (df1
), and fills in NaN for columns on the right (df2
) that don't have a matching key.right
: Keeps all rows from the right DataFrame (df2
), and fills in NaN for columns on the left (df1
) that don't have a matching key.
Join
Joining is similar to merging but is performed on the index rather than a specific column. Here's a simple example using the join()
function:
df1 = pd.DataFrame({'Name': ['Ellis', 'Bob', 'John']}, index=[1, 2, 3])
df2 = pd.DataFrame({'Age': [25, 30, 22]}, index=[1, 2, 4])
joined_df = df1.join(df2, how='inner')
print(joined_df)
python
Name Age
1 Ellis 25
2 Bob 30
Data visualization in Python
Matplotlib and Seaborn are Python library that works seamlessly with pandas for creating various types of plots. Explore this topic to delve deeper into data visualization with Matplotlib and Seaborn!