A Detailed Tutorial on Polynomial Regression in Python, Overview, Implementation, and Overfitting
Polynomial Regression in Python

A Detailed Tutorial on Polynomial Regression in Python, Overview, Implementation, and Overfitting

I have been writing tutorials on Machine learning, deep learning, data visualization, analysis, and statistics for some time. But I realized I didn’t write too much about some simple machine learning pipelines that can be very useful. Though there are so many more advanced machine-learning tools and packages out there nowadays, still these simple machine-learning tools are still relevant and useful.

This article will be about Polynomial Regression basics and implementation using the scikit-learn library in Python. We will also work on an overfitting experiment for machine learning beginners.

Polynomial Regression Overview

Polynomial regression is one of the basic machine learning algorithms that can be useful in some business problems till now. It is built on linear regression. Polynomial regression is built on the limitation of linear regression, that linear regression only works when the relationship between the input variable and the output variable is linear. Something like this.

But the relationship between input and output variables is not linear like this most of the time.

As a reminder, linear regression follows the very basic formula that we all learned in school:

Y = C + BX

Here Y is the output variable X is the input variable, C is the intercept and B is the slope.

And in machine learning, we just use different terms:

Here, h is the hypothesis or the predicted output variable, X is the input variable, theta1 is the coefficient and theta0 is the bias term.

But the relationship between input and output variables is not linear like this most of the time.

The relationships between the input and output variables can be of any shape. In any other shape, linear regression cannot work. So polynomial regression comes in handy.

For simplicity, say we have only one input variable. For, polynomial regression the, hypothesis h becomes:

Look, we make several variables out of one input variable X just using different powers on it. Polynomial regression may look like this as well:

This is all about the overview of polynomial regression. I am not going further than this because we are going to use scikit-learn library to implement the Polynomial regression. If you are interested in learning more details about polynomial regression, please check out this link:

Implementation of Polynomial Regression

Here we will implement a polynomial regression using Python’s scikit-learn library. I will use the insurance .csv dataset from Kaggle. Here is the link to the dataset:

First, let’s make a Pandas DataFrame using this dataset. I downloaded the dataset to use it:

import pandas as pd 
df = pd.read_csv("insurance.csv")
df.head()

Data Preparation

It is a good idea to check for the null values in the dataset in the beginning. Because if we have null values machine learning model will not work. The following line of code will return the number of null values in each column of the DataFrame:

df.isna().sum()

So, we have zero null values in every column of the DataFrame. Very nice!

The dataset has three columns with string values. Because machine learning models can take only numeric values, we need to replace those categorical string values with numeric values:

df['sex'] = df['sex'].replace({'female': 1, 'male': 2})
df['smoker'] = df['smoker'].replace({'yes': 1, 'no': 2})
df['region'] = df['region'].replace({'southwest': 1, 'southeast': 2, 'northwest': 3, 'northeast': 4})

With that data preparation is done.

Defining Input Features and Target Variable

In this exercise, we will try to predict the ‘charges’ based on the other variables in the dataset. So, ‘charges’ will be our output variable and all the other variables will be our input variables. In that case, if we just drop the ‘charges’ from the ‘df’ we will get our input variables, and if we just separate ‘charges’ from the ‘df’ that will be our output variable or target variable:

X = df.drop(columns = ['charges'])
y = df['charges']

Here, X is the input variable and y is the output variable.

Split the Data for Training and Testing

This is important. We need to train the model and also test its performance. For this reason, we will split the dataset into training and testing data. Training data will be used for training the model only and the testing data will be used to test the model performance.

The scikit-learn library has a train_test_split method for that. We will put X, y that we previously defined, test_size of 0.25 which means 25% of the data will be kept for testing purposes,s and random_state of 1 (you can use any other integer as random_state).

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 1)

Scaling the Date

It is a good practice to scale the data and bring them to the same range in Polynomial Regression. We will use the Standard Scaler method from the scikit-learn library for that:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

Model Development

As Polynomial Regression is based on Linear Regression we need to import Linear Regression and Polynomial Features method. Model development has a bit extra step than Linear Regression.

  1. The first step is to call the PolynomialFeatures method with the degree of power.

2. Fitting the input features to the PolynomialFeatures method for both training and testing data.

3. Fitting the training input and output variables into the Polynomial Features method.

4. Finally, when the data is ready, use the Linear Regression method to train the model.

Here is the code:

from sklearn.preprocessing import PolynomialFeatures 
poly = PolynomialFeatures(degree=6)
X_poly_train = poly.fit_transform(X_train_scaler)
X_test_poly = poly.transform(X_test_scaler)
poly.fit(X_poly_train, y_train)
lin.fit(X_poly_train, y_train)

The model development and training part is done!

Model Evaluation

We will evaluate the model using both training and testing data here. Starting with testing data, here is the predicted ‘charges’ for the test data:

y_pred = lin.predict(X_test_poly)

Calculating the mean absolute error:

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

Output:

285296796712.7246

The mean absolute error looks pretty big! Now see how the model performs on training data:

y_pred_train = lin.predict(X_poly_train)
mean_absolute_error(y_train, y_pred_train)

Output:

1970.8913236804585

On training, data mean absolute error is much smaller than the mean absolute error on the test data. That means the model learned the training data so well that it can only perform well on training data. When we try some new data model performs poorly. This is called overfitting.

We do not want to overfit. We want the model to perform well in any data.

Solving Overfitting Problem

Solving the overfitting problem starts with hyperparameter tuning. We should try different hyperparameters to find the optimum values where the model performs well on both training and testing data. This was a simple model. We have only one hyperparameter here which is the degree in the PolynomialFeatures method. Before, we used the degree as 6, now we will try with the degree of 3.

poly = PolynomialFeatures(degree=3)
X_poly_train = poly.fit_transform(X_train_scaler)
X_test_poly = poly.transform(X_test_scaler)
poly.fit(X_poly_train, y_train)
lin = LinearRegression()
lin.fit(X_poly_train, y_train)

Let’s test the mean absolute error on training and test data now. Starting with the test data.

y_pred = lin.predict(X_test_poly)
mean_absolute_error(y_test, y_pred)

Output:

2819.746326567164

Look! The mean absolute error was lowered by a huge margin than before.

We should still check how it works on the training data:

y_pred_train = lin.predict(X_poly_train)
mean_absolute_error(y_train, y_pred_train)

Output:

2818.997792755733

Wow! The mean absolute error on training data and test data are so close: 2819 and 2818 respectively. This is perfect!

It may not be that close in your case always. But they should be close enough.

Conclusion

This tutorial was to explain the basic concept of one of the basic machine learning models and the implementation in the scikit-learn library. It also gave a test on what overfitting is and a glance at how to solve it. Hope it was helpful!

Leave a Reply

Close Menu