Pandas for Data Manipulation in Python

Overview

Pandas, a Python library built on top of the NumPy library, contains essential tools for anyone working with structured data in Python. In just 10 minutes, get started with pandas basics!

Import pandas
Data structures
Inspect data
Sort & subset data
Manipulate data
Group and reshape data

Import pandas

First, install pandas and numpy, using the Install Python Packages topic. Then import it with the following code:

python


import numpy as np
import pandas as pd

Data structures in pandas

Pandas has two primary data structures: Series and DataFrame.

Series

The Series is a one-dimensional data structure, consisting of values and an index. You can create a Series using various data types, such as a list or a dictionary.

python


# List
grades = [5.5, 8.4, 9.6, 7.0, 7.5]

# Create Series from list
series = pd.Series(grades)
series

python

0    5.5
1    8.4
2    9.6
3    7.0
4    7.5
dtype: float64

Index

The default index is from 0 to n-1, where n is the length of the data. You can change this: for example, if you want the index from 1 to n, or want to have different labels.

python


# List

grades = [5.5, 8.4, 9.6, 7.0, 7.5]
series = pd.Series(grades, index = [1, 2, "3rd student", 4, 5])
series

python

1              5.5
2              8.4
3rd student    9.6
4              7.0
5              7.5
dtype: float64

Handling different data types

Series can handle different data types within the same structure. For example, the grades list now includes a string.

python


# Including a string in the list
grades = [5.5, 8.4, "highest grade", 7.0, 7.5]

# Create Series
series = pd.Series(grades, dtype = object)

python

0              5.5
1              8.4
2    highest grade
3              7.0
4              7.5
dtype: object

Here, the Series interprets elements as objects to accommodate the different data types (the data includes both floating-point numbers and a string). You can also specify the data type with the dtype parameter.

Data can also be a one-dimensional ndarray, a Python dictionary, or a scalar value. When creating a Series from a dictionary, the keys become the index and values become the data.

python


# Dictionary
grade_dict = {"Ellis": 9.6, "John": 8.5, "Casper": 5.3} 

# Create Series
series = pd.Series(grade_dict)
series

python

Ellis     9.6
John      8.5
Casper    6.7
dtype: float64

Call values or index to see the individual component.

python


series.values

returns array([9.6, 8.5, 5.3]), and

python


series.index

returns Index(['Ellis', 'John', 'Casper'], dtype='object')

DataFrame

A DataFrame is a two-dimensional data structure in pandas, organized into rows and columns. Each column is essentially a pandas Series. While columns should have the same length, they can hold different data types.

You can create a DataFrame from a dictionary where keys become column names, and values form the data in those columns.

By default, pandas intuitively generates index and column names. However, you can customize these by explicitly specifying them during DataFrame creation with the index and column arguments.

python


# Dictionary
grades = {
    'Name': ['Ellis', 'John', 'Casper'],
    'Midterm Grade': [9.6, 8.5, 5.3],
    'Final Grade': [9, 7, 6]
}

# Create DataFrame from Dictionary
df = pd.DataFrame(grades, 
                  index= ['Student 1', 'Student 2', 'Student 3'], 
                  columns= ['Name', 'Midterm Grade', 'Final Grade'])
df

python

             Name  Midterm Grade  Final Grade
Student 1   Ellis            9.6            9
Student 2    John            8.5            7
Student 3  Casper            5.3            6

Import data from another file

Instead of manually creating a DataFrame in Python, pandas provides convenient functions to import data from various file formats.

read_csv(): for importing data from CSV files
read_excel(): for reading data from Excel files
read_html(): to read data from HTML tables
read_sql(): for importing data from SQL queries or database tables

To use these functions, provide the file path as an argument. For instance:

python


df2 = pd.read_excel("\\example_path\\file_name.xlsx")

Inspect data

To inspect your data, you can use the following functions:

.head(): Displays the first few rows (default value is 5)
.info(): Provides information on each column, such as the data type and the number of missing values.
.shape: Returns the number of rows and columns.
.describe(): Computes summary statistics for each numerical column, such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.

For instance, to examine the DataFrame df that was created before:

python


df.shape

returns (3, 3), indicating that df has 3 rows and 3 columns.

To explore specific aspects of the DataFrame, you can use df.index, df.columns, and df.values. For example:

python


df.index

returns Index(['Student 1', 'Student 2', 'Student 3'], dtype='object')

Sort & subset data

Sorting data is the process of arranging it in a specific order.

sort_index(): sort by index values along a specified axis. By default, sort data along the index axis (rows).
sort_values(): sort by values of the column specified with argument by =.

python


df.sort_values(by = "Final Grade")

python

             Name  Midterm Grade  Final Grade
Student 3  Casper            5.3            6
Student 2    John            8.5            7
Student 1   Ellis            9.6            9

The data is now sorted based on the values in the Final Grade column, with the lowest grade at the top and the highest at the bottom. To sorte in descending order and get the highest grade on top of the DataFrame, set ascending to False.

python


df.sort_values(by = "Final Grade", ascending = False)

python

             Name  Midterm Grade  Final Grade
Student 1   Ellis            9.6            9
Student 2    John            8.5            7
Student 3  Casper            5.3            6

Subset data

Subsetting data involves selecting specific columns or rows from a DataFrame.

Select specific columns

To look at just one column, you can use square brackets [] and specify the column name. For example, let's say we want to look at the "Final Grade" column:

python


df["Final Grade"]

python

Student 1    9
Student 2    7
Student 3    6
Name: Final Grade, dtype: int64

To select multiple columns by passing a list of column names within square brackets:

python


df[["Final Grade","Name"]]

python

           Final Grade    Name
Student 1            9   Ellis
Student 2            7    John
Student 3            6  Casper

Select specific rows

To select specific rows, you can use .loc, and specify the index label between square brackets:

python


df.loc['Student 1']

python

Name             Ellis
Midterm Grade      9.6
Final Grade          9
Name: Student 1, dtype: object

Boolean indexing

To subset rows/columns based on specific conditions, you can use Boolean indexing. For example, select only the rows where the final grade is greater than or equal to 7.0:

python


df_high = df[df['Final Grade'] >= 7.0]
df_high

python

            Name  Midterm Grade  Final Grade
Student 1  Ellis            9.6            9
Student 2   John            8.5            7

When subsetting based on multiple conditions, you can use logical operators to combine conditions:

&(AND): Only rows where both conditions hold true are included in the result.

python


# Midterm grade higher than 7 AND Final grade is 9
df[(df["Midterm Grade"] > 7) & (df["Final Grade"] == 9)]

python

            Name  Midterm Grade  Final Grade
Student 1  Ellis            9.6            9

| (OR): Rows are included in the result if either one or the other condition is true.

python


# Midterm grade higher than 7 OR Final grade is 9
df[(df["Midterm Grade"] > 7) | (df["Final Grade"] == 9)]

python
            Name  Midterm Grade  Final Grade
Student 1  Ellis            9.6            9
Student 2   John            8.5            7

Manipulate data

Add and delete columns

To add a new column, assign values to a new column name. These values can be new, or calculated from the values of other columns in the DataFrame.For example, you want to have a new column with the age of each student, and one column with the mean of the Midterm and Final grade of each student.

python


df["Age"] = [19, 22, 21]
df["Mean Grade"] = (df["Midterm Grade"] + df["Final Grade"]) / 2
df

The two new columns are added:

python

             Name  Midterm Grade  Final Grade  Age  Mean Grade
Student 1   Ellis            9.6            9   19        9.30
Student 2    John            8.5            7   22        7.75
Student 3  Casper            5.3            6   21        5.65

To delete the Age column again, you can use .drop(). axis=1 indicates that it's operating along columns.

python


df.drop('Age', axis = 1, inplace = True)
df

python

             Name  Midterm Grade  Final Grade  Mean Grade
Student 1   Ellis            9.6            9        9.30
Student 2    John            8.5            7        7.75
Student 3  Casper            5.3            6        5.65

Tip

Modifying the original DataFrame

To modify the original DataFrame, you can set inplace = True. This directly modifies the original DataFrame. Alternatively, you can assign the modified DataFrame to the same variable with df = df.drop('Age', axis = 1). Both approaches achieve the same result.

Transpose data

With .T behind the DataFrame name, you change the axis: the columns and the indices are switched.

python


df.T

python

              Student 1 Student 2 Student 3
Name              Ellis      John    Casper
Midterm Grade       9.6       8.5       5.3
Final Grade           9         7         6
Age                  19        22        21
Mean Grade          9.3      7.75      5.65

Handle missing data

Handling missing data is a crucial aspect of data analysis. In any real-world dataset, missing values are common and can impact the accuracy of your analysis. First, we create an example DataFrame with missing values.

python


data = {'Name': ['Alice', 'Bob', 'Charlie', np.nan, 'Eva'],
        'Age': [25, np.nan, 30, 22, 35],
        'Score': [90, 85, np.nan, 78, np.nan]}

df = pd.DataFrame(data)

python

      Name   Age  Score
0    Alice  25.0   90.0
1      Bob   NaN   85.0
2  Charlie  30.0    NaN
3      NaN  22.0   78.0
4      Eva  35.0    NaN

Detect missing data with .isna().sum()

python


df.isna().sum()

python

Name     1
Age      1
Score    2
dtype: int64

This displays the number of missing values per column.

Remove missing data with .dropna()

This function drops any rows that have missing data. df.dropna(axis=0) removes rows with any missing values, and df.dropna(axis=1) removes columns.

python


df_droprow = df.dropna(axis = 0)
df_droprow

python

  Name   Age  Score
0  Alice  25.0   90.0

Only the first does not contain missing values and is left in the new DataFrame.

python


df_dropcolumn =  df.dropna(axis = 1)
df_dropcolumn

python

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

All columns contained missing values, so none are left.

Filling missing data

Use .fillna() to fill missing values with a specific value that you specify in between the brackets. For example, fill in the missing values with the mean of each column:

python


df.fillna(df.mean())

python

      Name   Age      Score
0    Alice  25.0  90.000000
1      Bob  28.0  85.000000
2  Charlie  30.0  84.333333
3      NaN  22.0  78.000000
4      Eva  35.0  84.333333

Or you can use method ='ffill' or method ='bfill' to fill missing values with the previous or next valid value. For example:

python


df.fillna(method ='ffill')

python


      Name   Age  Score
0    Alice  25.0   90.0
1      Bob  25.0   85.0
2  Charlie  30.0   85.0
3  Charlie  22.0   78.0
4      Eva  35.0   78.0

Group and reshape data

Grouping

Grouping data allows you to split your dataset into groups based on specific criteria and perform operations on each group. Let's consider the following example DataFrame for student exam scores:

python


data = {
    'name': ['Alice', 'Eva', 'Charlie', 'Jack', 'Zara'],
    'gender': ['F', 'F', 'M', 'M', 'F'],
    'age': ['20', '20', '21', '19', '19'],
    'exam grade': [8.4, 9.6, 9.1, 5.8, 6.3]
}

students_df = pd.DataFrame(data)
print(students_df)

python

    name gender age  exam grade
0    Alice      F  20         8.4
1      Eva      F  20         9.6
2  Charlie      M  21         9.1
3     Jack      M  19         5.8
4     Zara      F  19         6.3

To group the data based on gender, use the groupby() function.

python


# Group by gender
data_gender = students_df.groupby('gender')

# View all entries of group 'F' 
data_gender.get_group('F')

python
    name gender age  exam grade
0  Alice      F  20         8.4
1    Eva      F  20         9.6
4   Zara      F  19         6.3

You can also group based on more than one category, for example, gender and age:

python


# Group by gender and age
data_gender_age = students_df.groupby(['gender', 'age'])

# View all entries of group 'F' & age '20'
data_gender_age.get_group(('F', '20'))

python

    name gender age  exam grade
0  Alice      F  20         8.4
1    Eva      F  20         9.6

Once you have your data grouped, you can perform different operations within each group to extract meaningful information from your dataset. Three common operations are aggregation, transformation, and filtration.

Aggregation

Aggregation involves combining data within each group to obtain a single value. For example, calculating the average exam grade for each gender:

python


avg_grade_by_gender = students_df.groupby("gender")["exam grade"].agg("mean")
print(avg_grade_by_gender)

python

gender
F    8.10
M    7.45

Tip

Pandas provides various aggregation functions to summarize data within groups. Common aggregation functions are sum, mean, min, max, median, count, std, first, and last.

Transformation

Transformation applies a function to each group independently. Let's transform the exam grades to represent the difference from the mean grade within each gender group:

python


grade_difference = students_df.groupby("gender")["exam grade"].transform(lambda x: x - x.mean())
print(grade_difference)

python
0    0.30
1    1.50
2    1.65
3   -1.65
4   -1.80

Filtration

Filtration allows you to filter groups based on some condition. For example, keeping only groups with a mean exam grade greater than a certain threshold:

python


ages_above8 = students_df.groupby("age").filter(lambda x: x["exam grade"].mean() > 8.0)
print(ages_above8)

The group based on age that scored a mean exam grade above 8 are age '20' and age '21'.

python

      name gender age  exam grade
0    Alice      F  20         8.4
1      Eva      F  20         9.6
2  Charlie      M  21         9.1

Pivot tables

Pivot tables allow you to reshape and summarize data and are particularly useful for aggregating and analyzing data based on one or more criteria. Let's create a pivot table for average exam grades considering both gender and age:

python


pivot_table = students_df.pivot_table(values='exam grade', index=['gender', 'age'], aggfunc='mean')
pivot_table

python

            exam grade
gender age            
F      19          6.3
       20          9.0
M      19          5.8
       21          9.1

Multi-level indexing

Multi-level indexing is a feature in pandas that allows you to have more than one column in your DataFrame's index, allowing you to have multiple columns acting as a row identifier.

Let's create an example using the student exam scores DataFrame. The df now contains multiple levels of indexing: in this case gender and age.

python


multi_index_df = students_df.set_index(['gender', 'age'])
multi_index_df

python

             name  exam grade
gender age                    
F      20   Alice         8.4
       20     Eva         9.6
M      21  Charlie         9.1
       19    Jack         5.8
F      19    Zara         6.3

You can now perform operations using this multi-level index. For example, let's select data for females:

python


female_data = multi_index_df.loc['F']
female_data

python

      name  exam grade
age                    
20   Alice         8.4
20     Eva         9.6
19    Zara         6.3

This is a simple example, but multi-level indexing becomes extremely useful when dealing with more complex datasets where you want to organize and analyze data hierarchically.

Combine data

pandas provides various methods for combining data.

Concatenate

The concat() function is for combining two or more DataFrames along a particular axis. Consider the following example:

python


df1 = pd.DataFrame({'Name': ['Ellis', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['John', 'Zara'], 'Age': [22, 28]})

# Concatenate along the rows (axis=0)
total = pd.concat([df1, df2], ignore_index=True)
total

python

    Name  Age
0  Ellis   25
1    Bob   30
2   John   22
3   Zara   28

The output is a DataFrame where df2 is appended below df1. With ignore_index = True the index is reset to a new continuous range.

Merge

If you are combining datasets based on a common column, use the merge() function.

python


df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Ellis', 'Bob', 'John']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 22]})

# Merge based on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how= 'inner')
print(merged_df)

python

   ID   Name  Age
0   1  Ellis   25
1   2    Bob   30

Information is combined from both df1 and df2 based on the common ID column. The on parameter specifies the column to merge on, and the how parameter defines the type of merge (here, inner).

Tip

Different types of merges

inner: Keeps only the rows where the key(s) are present in both DataFrames.
outer: Keeps all rows from both DataFrames and fills in missing falues with NaN for columns that don't have a matching key.
left: Keeps all rows from the left DataFrame (df1), and fills in NaN for columns on the right (df2) that don't have a matching key.
right: Keeps all rows from the right DataFrame (df2), and fills in NaN for columns on the left (df1) that don't have a matching key.

Join

Joining is similar to merging but is performed on the index rather than a specific column. Here's a simple example using the join() function:

python


df1 = pd.DataFrame({'Name': ['Ellis', 'Bob', 'John']}, index=[1, 2, 3])
df2 = pd.DataFrame({'Age': [25, 30, 22]}, index=[1, 2, 4])

joined_df = df1.join(df2, how='inner')
print(joined_df)

python

    Name  Age
1  Ellis   25
2    Bob   30

Tip

Data visualization in Python

Matplotlib and Seaborn are Python library that works seamlessly with pandas for creating various types of plots. Explore this topic to delve deeper into data visualization with Matplotlib and Seaborn!

Contributed by Valerie Vossen

Suggest changes to this page

Clean Numeric Data Scraped From Social Media

This is a handy function that can be used to clean social media data scraped from the web.

clean

wrangling

scraping

follow

likes

network

social media

function

numeric conversion