[XGBoost, Extreme Gradient Boosting, Machine Learning, Python]


Overview

This building block explains XGBoost, a leading machine learning algorithm renowned for its efficiency and effectiveness. As you progress through this guide, you will acquire practical experience and enhance your comprehension of:

  • The inner workings of XGBoost.

  • Using XGBoost in Python, understanding its hyperparameters, and learning how to fine-tune them.

  • Visualizing the XGBoost results and feature importance

What Is XGBoost?

XGBoost, an open-source software library, uses optimized distributed gradient boosting machine learning algorithms within the Gradient Boosting framework.

XGBoost, short for Extreme Gradient Boosting, represents a scalable and distributed gradient-boosted decision tree (GBDT) machine learning library. It offers parallel tree boosting and holds a prominent position as a machine learning library for addressing regression, classification, and ranking challenges.

Installing XGBoost in Python

To use XGBoost for classification or regression tasks in Python, you’ll need to install and import the xgboost package.

To install the package use pip:

# install XGBoost 
pip install xgboost 

Importing the Package into the Workspace

Import the package into your python script or notebook using the following convention:

# import XGBoost
import xgboost as xgb
# we'll also need pandas
import pandas as pd

Working Example: Predicting Diamond Prices

Loading the Data

To see XGBoost in action we will build a model that predicts the price of diamonds based on their characteristics. Let’s load the diamonds dataset from Seaborn package as our example dataset:

# import Seaborn 
import seaborn as sns

# loading dataset
diamonds = sns.load_dataset("diamonds")

Configuring the Data

After loading and inspecting your data, you should define your X and y variables. In this example we want to predict the price of a diamond based on its characteristics, so we set price as your y variable and the diamond’s characteristics as your predictors, X.

# selecting all column except 'price' as predictors
X= diamonds.loc[:, diamonds.columns != 'price']

# specify 'price' column to 'y'
y = diamonds[['price']]

The test-train split

Now, let’s divide the data into training and testing sets using the sklearn.model_selection module and then check the shape of the data.

# importing train_test_split 
from sklearn.model_selection import train_test_split
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# check the shapes
import os 
print('X-train shape:',X_train.shape,
     'X-test shape:' , X_test.shape,
     'y_train shape:' ,y_train.shape,
     'y_test shape:' ,y_test.shape,sep=os.linesep )

Before training the dataset using the XGBoost algorithm, we need to complete two essential steps: storing the dataset in a DMatrix . XGBoost uses the DMatrix class to efficiently store the dataset, enabling you to run XGBoost optimally.

# import xgboost package
import xgboost as xgb

# train and test data which automaticaly create regression matrices
train = xgb.DMatrix(X_train, y_train, enable_categorical=True)
test= xgb.DMatrix(X_test, y_test, enable_categorical=True)

Specifying Hyperparameters

Next, we set the algorithm’s hyperparameters and specify which data stored in a DMatrix to use for training and testing. A dictionary or list of tuples is used to specify both:

# Define Parameters
param = {"objective": "reg:squarederror"}

# set evaluation 
evallist = [(train, 'train'), (test, 'eval')]

Training the Model

You also need to decide on the number of boosting rounds, which determines how many rounds the XGBoost algorithm will use to minimize the loss function. For now, let’s set it to 10, but it’s important to note that this is one of the parameters that should be tuned.

# train the model 
num_round = 10
model = xgb.train(param, train, num_round, evallist)

The output displays the model’s performance (RMSE) on both the training and validation sets. As mentioned earlier, the number of boosting rounds is a crucial tuning parameter. In fact, more rounds mean more attempts to minimize the loss. However, it’s important to be careful to prevent overfitting. Sometimes, the model may stop improving after a certain number of rounds. To address this, you can use the following code. We increase the number of rounds to 1000 and introduce the early_stopping_rounds parameter to the code above. This ensures that training stops after 50 rounds if there is no improvement in the results.

# increase number of boosting rounds 
num_round = 1000
model = xgb.train(param, train, num_round, evallist,
   # enabeling early stopping
   early_stopping_rounds=50)

XGBoost offers a variety of regularization techniques for model refinement. For detailed information on tuning parameters, please refer to the provided resource .

Visualizing the Results

Once the model is trained, we can visualize the feature importance, i.e. which characteristics have the largest effect on price. We can also see the trees that have been created.

import matplotlib.pyplot as plt 
# visualize the importance of predictors
xgb.plot_importance(model)

To visualize the trees:


xgb.plot_tree(model)
plt.show()
Tip

Note that to visualize the trees you need graphviz.

Summary

Tip

You can also import XGBoost from Scikit-learn but native API of XGBoost has more capabilities.

Summary

We covered how to use XGBoost in Python. In particular, we covered:

  • Implementing XGBoost in Python
  • Training the model and evaluation
  • Hyperparameter specification
  • Visualizing the trained model output
Contributed by Kheiry Sohooli