# Linear Regression for Prediction in Python

5 mins

## Overview

This article introduces how to use linear regression to predict a continuous outcome variable and the steps to implement it in Python.

Regression is a technique used in supervised machine learning when the goal is to predict a continuous dependent variable (target) based on one or more independent variables (predictors). Think of it as trying to fit a line (or a curve in some cases) through data points to establish a relationship between the independent and dependent variables. By understanding this relationship, we can make predictions about the target variable for new data points.

For instance, consider the example of predicting a house’s price (dependent variable) based on several predictors like its size in square feet (independent variable). Using regression, we can fit a line through the historical data of houses sold and their properties. Once this line is established, if we have a new house of a specific characteristics, we can predict its approximate price by looking at where it falls on the line.

There are several regression techniques, such as linear regression, Lasso, and Ridge, each designed to handle specific data characteristics and relationships between variables.

For this tutorial the dataset on housing prices from kaggle will be used.

## Machine learning Linear Regression Pipeline

### Get the Data

To import CSV data into Pandas, use the read_csv function, accepting file paths or URLs. This method offers customization options via various parameters. Explore reading data from Kaggle click here .

Tip

### Take a Quick Look at the Data

The `head()` method shows you the first 5 rows of your data along with the column names. When you use it, you can see what features are in your dataset and get a quick overview of the values in each column. It’s a handy way to get a glimpse of your data.

Additionally, the `info()` functions shows useful information about the total number of samples(rows), and type of each feature(float64, object, etc.) and number of non-null values in each column.

Another helpful method is `describe()`, and it provides a summary of numerical features in your data. By running the following code, you can see the statistical information you can obtain using this method.

### Preprocessing the Data for Machine Learning Algorithms

#### Data Cleaning

Cleaning data involves different steps, and what you do depends on the dataset.

1. Removing Duplicates: Easily eliminate repetitive samples for cleaner data.
1. Handling Missing Data: For this particular case, we choose to drop the rows with missing values.

#### Handling Categorical Attributes

Use `LabelEncoder` from scikit-learn to transform categorical values into numerical ones.

#### Splitting data into train and test sets

Before you start training your model on the data, set aside a part of it to test and evaluate the model afterwards. This process is known as train-test-split. The easiest way to do this is by using the `train_test_split()` function from the `sklearn` library.

In the code above, pay attention to the “test_size” parameter. It decides how much of the dataset is kept for testing. In our case, we’ve reserved 25% of the dataset for testing. Also, we set a random seed to make sure the split dataset doesn’t change every time we run the code.

#### Feature Scaling

This transformation is important when variables have different ranges. For instance, if you compare “rooms” (ranging from 1 to 7) with “yearbuilt” (ranging from 1196 to 2106), you’ll notice a big difference. Many machine learning algorithms don’t work well when features are in different scales.

Tip

Note that scaling the target variable is not necessary.

Tip

It is very important to fit scaler on train data. Then transform on train and test data to avoid information leakage.

### Select and Train a Model

Finally! You’ve defined the problem, explored the data, selected samples for training and testing, and cleaned up the data for ML. Now, it’s time to choose and train a ML model.

Note that you can train your data using any other regressors but here we just work with one. Feel free to explore other regressors and compare the results.

### Evaluate the model

The most common evaluation metrics for regressions are R-squared (R2), Mean squared error(MSE), and Mean absolute error (MEA).

Summary

In this article, you have explored the fundamentals of regression in Python. After an introduction to regression, we’ve outlined the essential steps for prediction:

• Reading Data in Python: Understanding how to import and access data within Python.

• Exploring Dataset: Describing the data via summary statistics to understand its structure.

• Data Preprocessing: Tasks include handling duplicates, addressing missing values, managing categorical variables, splitting data into training and testing sets, and feature scaling.

• Selecting and Training a Model: Choosing an appropriate regression model and training it using the prepared data.

• Evaluating the Model: Assessing the performance of the trained model using suitable evaluation metrics.