Linear regression is the most basic type of machine learning. It is based on the simple straight-line formula we have all learned in middle school. Though there are a lot of other more complicated and efficient machine learning algorithms out there, still it is a good idea to learn the linear regression very well to begin with. Because a lot of other popular machine learning and deep learning algorithms are built on linear regression.
In this article:
1. We will talk about how linear regression works
2. Work on a simple linear regression problem using Python’s scikit-learn library
3. Work on a multiple linear regression problem using a real dataset in the Scikit-learn library.
If you want to see how to develop a linear regression from scratch in plain Python without any library, look at the links at the end of this page.
Prerequisites
You are expected to know at least the beginner-level python. Also, it is necessary to know the beginner-level Pandas and Matplotlib library to get started with machine learning.
What is Linear Regression?
In simple language, linear regression describes the relationship between dependent and independent variables. It describes the relationship by fitting a straight line. That’s why it is linear regression. Let’s understand it using a dataset. Look at this dataset where we have two variables: weight(lbs) and Height(inches).

Let’s plot them by putting Weight in the X-direction and Height in the y-direction.

We get the dots. Then if we draw a straight line through the dots that best fits the dots, we can predict the Height if we have the Weight information. For example, if the weight is 160 lbs, we can simply draw a straight line from the 160 points and draw another straight line from the same point from the straight line to the y-axis as shown in the picture below. That way we can find the Height(inches) which is about 66 inches.

Here Weight is the independent variable and Height is the dependent variable.
Here is the straight-line formula as a refresher:
Y = mX + C
Where,
Y is the dependent variable (in the picture above Height is the Y)
X is the independent variable (in the picture above Weight is the X)
m is the slope of the straight line
C is the Y-intercept
The task for a linear regression will be to find out ‘m’ and ‘C’ so that the straight line best represents all the dots. So, m and C are called training parameters for linear regression. For different machine learning algorithms, training parameters are different.
Here is the video version of the basics of linear regression:
Assumptions in Linear Regression
- Observations are independent: It is expected that the observations in the dataset are taken using a valid sampling method and they are independent of each other.
- Linear Correlation: It is also necessary for linear regression to work that the dependent variable and independent variable are linearly correlated as shown in the picture above. If not then we need to try other machine learning methods.
3. Normality: The data follows a normal distribution. Though if we have a big enough dataset we do not worry about the normality anymore.
Simple Linear Regression Example
Let’s work on an example. For this example, I will use this famous ‘iris’ dataset from Seaborn. For that, I need to import the seaborn library first. Then use the load_dataset function from the seaborn library to load the iris dataset.
import seaborn as sns
iris = sns.load_dataset('iris')
iris

As you can see, there are several variables in the iris dataset. For simple linear regression, we need only two variables, I will keep only petal_length and petal_width from here.
iris = iris[['petal_length', 'petal_width']]
Here is what the iris dataset looks like now:

Let’s consider petal_length as the independent variable and petal_width as the dependent variable. If we compare that with the straight-line formula that we discussed above petal_length is the X and petal_width is the ‘y’.
X = iris['petal_length']
y = iris['petal_width']
Before diving into the linear regression, we should check if our X and y are linearly correlated. I will use the Matplotlib library to make a scatter plot:
import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.xlabel("petal length")
plt.ylabel("petal width")

It shows that the relationship between the variables is linear.
In machine learning, we do not use the whole dataset to develop the model. We split the dataset into two portions. One portion is for building the model which is called the training set and the other portion is for evaluating the model which is called the test data set.
It is important to keep a portion of the dataset for testing purposes. Because our goal is to build a model that generalizes and also works well on the data that the model hasn’t seen before.
We will use the train_test_split method from the scikit-learn library for that.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 23)
Here test_size 0.4 means it will keep 40% of the data for testing purposes.
The random_state = 23 means if we use this same dataset and use random_state 23, it will recreate the same train-test split again. Random_state can be any other integer.
Let’s have a look at the X_train:
X_train
Output:
77 5.0
29 1.6
92 4.0
23 1.7
128 5.6
...
39 1.5
91 4.6
31 1.5
40 1.3
83 5.1
Name: petal_length, Length: 90, dtype: float64
As you can see X_train is a one-dimensional series. The machine learning models in the sklearn library take two-dimensional data for training features. So X always needs to be two-dimensional and y needs to be one_dimentional.
So, we need to make our X two-dimensional. That can be done with this simple code:
import numpy as np
X_train = np.array(X_train).reshape(-1, 1)
X_train
Here is a part of the X_train now:
array([[5. ],
[1.6],
[4. ],
[1.7],
[5.6],
[4. ],
[4.8],
[5.6],
[5.1],
[4.9],
[1.4],
[1.6],
[5.6],
[1.4],
[1.6],
[5.5],
[5.1],
[4. ],
[1.4],
We need to change the X_test in the same way:
X_test = np.array(X_test).reshape(-1, 1)
X_test
Output:
array([[5.4],
[6. ],
[4.1],
[1.5],
[5. ],
[4.9],
[1.7],
[5.5],
[1.7],
[3.6],
[4.7],
[1.6],
[5.9],
[1.5],
[1.5],
[5.1],
[4.5],
[4.7],
[6.1],
[1.4],
[5.3],
[1.4],
[1.6],
[1.3],
[5.6],
Our data is ready. We need to import the Linear Regression model from the scikit-learn library first.
from sklearn.linear_model import LinearRegression
I will save the instance of the model in a variable :
lr = LinearRegression()
The next step is to fit the training data to the linear regression model.
lr.fit(X_train, y_train)
Model fitting is done. We should have the training parameters m and C. Here is the intercept C:
c = lr.intercept_
c
Output:
-0.3511327422143744
In machine learning language we do not use the term slope. Instead, it is known as the coefficient of X. Here is the coefficient for this model:
m = lr.coef_
m
Output:
array([0.41684538])
Now we can use the straight-line formula to predict the ‘y’ using X. Because our X is two-dimensional, the output will be two-dimensional. I will flatten the output to make it one-dimensional to show you:
Y_pred_train = m*X_train + c
Y_pred_train.flatten()
Output:
array([1.73309416, 0.31581987, 1.31624878, 0.3575044 , 1.98320139,
1.31624878, 1.64972508, 1.98320139, 1.7747787 , 1.69140962,
0.23245079, 0.31581987, 1.98320139, 0.23245079, 0.31581987,
1.94151685, 1.7747787 , 1.31624878, 0.23245079, 1.35793332,
1.85814777, 1.52467147, 2.06657046, 2.40004677, 1.44130239,
0.19076625, 1.31624878, 1.69140962, 1.69140962, 1.31624878,
0.27413533, 1.52467147, 1.52467147, 1.27456424, 1.73309416,
1.64972508, 1.2328797 , 1.7747787 , 2.27499315, 2.19162408,
0.14908171, 2.02488593, 0.8994034 , 0.27413533, 2.108255 ,
1.64972508, 0.23245079, 1.52467147, 1.39961786, 1.81646324,
0.19076625, 0.06571264, 1.10782609, 0.10739718, 1.60804055,
1.39961786, 0.14908171, 2.06657046, 1.44130239, 1.52467147,
0.31581987, 2.52510038, 1.56635601, 1.7747787 , 1.98320139,
1.60804055, 0.27413533, 0.31581987, 1.94151685, 2.06657046,
1.48298693, 0.19076625, 1.81646324, 1.02445701, 2.02488593,
1.10782609, 0.19076625, 0.27413533, 0.27413533, 1.7747787 ,
0.23245079, 0.23245079, 1.69140962, 0.23245079, 1.48298693,
0.27413533, 1.56635601, 0.27413533, 0.19076625, 1.7747787 ])
Here is the predicted output using the training parameter we found from the model.
But we do not have to use the training parameters and calculate the output using the formula. We can simply use predict method to do that:
y_pred_train1 = lr.predict(X_train)
y_pred_train1
Output:
array([1.73309416, 0.31581987, 1.31624878, 0.3575044 , 1.98320139,
1.31624878, 1.64972508, 1.98320139, 1.7747787 , 1.69140962,
0.23245079, 0.31581987, 1.98320139, 0.23245079, 0.31581987,
1.94151685, 1.7747787 , 1.31624878, 0.23245079, 1.35793332,
1.85814777, 1.52467147, 2.06657046, 2.40004677, 1.44130239,
0.19076625, 1.31624878, 1.69140962, 1.69140962, 1.31624878,
0.27413533, 1.52467147, 1.52467147, 1.27456424, 1.73309416,
1.64972508, 1.2328797 , 1.7747787 , 2.27499315, 2.19162408,
0.14908171, 2.02488593, 0.8994034 , 0.27413533, 2.108255 ,
1.64972508, 0.23245079, 1.52467147, 1.39961786, 1.81646324,
0.19076625, 0.06571264, 1.10782609, 0.10739718, 1.60804055,
1.39961786, 0.14908171, 2.06657046, 1.44130239, 1.52467147,
0.31581987, 2.52510038, 1.56635601, 1.7747787 , 1.98320139,
1.60804055, 0.27413533, 0.31581987, 1.94151685, 2.06657046,
1.48298693, 0.19076625, 1.81646324, 1.02445701, 2.02488593,
1.10782609, 0.19076625, 0.27413533, 0.27413533, 1.7747787 ,
0.23245079, 0.23245079, 1.69140962, 0.23245079, 1.48298693,
0.27413533, 1.56635601, 0.27413533, 0.19076625, 1.7747787 ])
Let’s check if the prediction is actually right. We can check using a visualization. I will add a line plot of X and the predicted ‘y’ on the scatter plot of petal_length and petal_width above.
import matplotlib.pyplot as plt
plt.scatter(X_train, y_train)
plt.plot(X_train, y_pred_train1, color ='red')
plt.xlabel("petal length")
plt.ylabel("petal width")

As this picture shows, the predicted ‘y’ fits the dots pretty well! But we did the prediction using training data and our model is trained on the training data.
Our goal was to train the model so that the model can work on other data as well, not only this same training data. That’s why we kept the test data to test if the model works on the test data as well.
So, here I will use the X_test to predict the y_test.
y_pred_test1 = lr.predict(X_test)
y_pred_test1
Output:
array([1.89983231, 2.14993954, 1.35793332, 0.27413533, 1.73309416,
1.69140962, 0.3575044 , 1.94151685, 0.3575044 , 1.14951063,
1.60804055, 0.31581987, 2.108255 , 0.27413533, 0.27413533,
1.7747787 , 1.52467147, 1.60804055, 2.19162408, 0.23245079,
1.85814777, 0.23245079, 0.31581987, 0.19076625, 1.98320139,
0.23245079, 0.44087348, 1.64972508, 1.48298693, 1.27456424,
0.27413533, 1.27456424, 0.19076625, 2.44173131, 0.27413533,
0.3575044 , 1.56635601, 1.02445701, 1.39961786, 2.14993954,
2.02488593, 0.44087348, 1.19119517, 0.23245079, 1.48298693,
1.73309416, 1.52467147, 2.31667769, 0.27413533, 1.35793332,
2.19162408, 1.89983231, 0.23245079, 1.98320139, 1.52467147,
1.60804055, 2.44173131, 1.39961786, 0.23245079, 1.7747787 ])
We should check if this prediction also fits the dots as well as the training data:
import matplotlib.pyplot as plt
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred_test1, color ='red')
plt.xlabel("petal length")
plt.ylabel("petal width")

As you can see, the predicted ‘y’ fits the data pretty well. Now if we have a petal_length, we will be able to predict the petal_width using this model. The predicted petal_width may not be exactly the same as the original. But it should be close enough.
This is the video version of the Simple Linear Regression tutorial:
Multiple Linear Regression Example
In the last example, we had only one variable to predict petal width. But in the real world, most of the time we work with several training features. Here we will work on a project that will resemble a real-world project.
We will use the insurance dataset from Kaggle for this demonstration. Please feel free to download the dataset from this link.
In this dataset, we have a total of 7 columns. Let’s see the dataset first:
import pandas as pd
df = pd.read_csv('insurance.csv')
df

Here is the insurance dataset where the last column is the charges and we have 6 other columns. The task will be to use the 6 other columns to predict the charges. We call these 6 other columns features or variables.
How does the linear regression formula work for this many features?
When we have multiple features, the straight-line formula Y = mX + C becomes this:
Y = m1X1 + m2X2 + m3X3 + …. mnXn
As you can see in the formula, each feature has its individual coefficient. So, in this project, we have 6 features and we will have 6 coefficients as well.
We cannot plot ‘y’ against X as we did for the simple linear regression. Because we have so many variables now. So, it’s not two-dimensional anymore. It has so many dimensions.
If you look at the dataset, it has several categorical variables that have string values. The machine learning algorithms in the sklearn library cannot handle the string values. It needs numeric values. So, I will convert those categorical values to numeric values.
df['sex'] =df['sex'].astype('category') df['sex'] = df['sex'].cat.codesdf['smoker'] =df['smoker'].astype('category') df['smoker'] = df['smoker'].cat.codesdf['region'] =df['region'].astype('category') df['region'] = df['region'].cat.codes

Look at the ‘df’ now. The ‘sex’ feature was a categorical variable where the values were ‘male’ or ‘female’. They became 1 or 0 now. The other categorical variables were also changed in the same way.
This is also a good idea to check if we have null values. Because if we have null values in the data, we will get errors while training the model.
df.isnull().sum()
Output:
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
We have no null values in any of the columns.
Now separate the X and y. As we will use 6 other variables to predict the ‘charges’, our X will be all of the 6 variables. From the dataset, if we drop the ‘price’ column, we will get our X:
X = df.drop(columns = 'charges')
X

As we will predict the ‘price’, our ‘y’ will be the ‘price’ column.
y = df['charges']
Using the train_test_split to get training and testing data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 23)
We already imported the Linear regression model before. So, we can simply use it now. I will save the linear regression model in a different variable called lr_multiple now and fit the training data to the model.
lr_multiple = LinearRegression()
lr_multiple.fit(X_train, y_train)
Model training is done. Let’s see the training parameters now. Here is the intercept:
c = lr_multiple.intercept_
c
Output:
-11827.733141795668
The slopes or the coefficients:
m = lr_multiple.coef_
m
Output:
array([ 256.5772619 , -49.39232379, 329.02381564, 479.08499828, 23400.28378787, -276.31576201])
We got our 6 coefficients as I mentioned before.
Let’s predict the price for training data and testing data both:
y_pred_train = lr_multiple.predict(X_train)
y_pred_test = lr_multiple.predict(X_test)
This time we are not going to plot as we have 6 features. But I will introduce another evaluation method. That is R2 score.
R2 score indicates the goodness of fit. It tells you how well your training features can explain the variance in the label. The value may lie between 0 and 1. The closer the r2_score to 1, the better the model performance is.
We need to import the r2_score first and then calculate r2_score. It takes original ‘y’ values and predicted ‘y’ values. As we are predicting for the test data, the original values will be y_test:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred_test)
Output:
0.7911113876316933
Our r2_score is 0.79. So, we can say that the model is pretty strong.
Here is the link to the video version of the multiple linear regression tutorial:
Conclusion
I hope, it was a good starting point for you. If you are more interested in learning how to develop the linear regression and other popular machine learning algorithms from scratch, please feel free to check out this link.
I hope it was helpful.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #MachineLearning #Python #ArtificialIntelligence #DataAnalytics