Linear regression is the most basic type of machine learning. It is based on the simple straight-line formula we have all learned in middle school. Though there are a lot of other more complicated and efficient machine learning algorithms out there, still it is a good idea to learn the linear regression very well to begin with. Because a lot of other popular machine learning and deep learning algorithms are built on linear regression.
In this article:
1. We will talk about how linear regression works
2. Work on a simple linear regression problem using Python’s scikit-learn library
3. Work on a multiple linear regression problem using a real dataset in the Scikit-learn library.
If you want to see how to develop a linear regression from scratch in plain Python without any library, look at the links at the end of this page.
You are expected to know at least the beginner-level python. Also, it is necessary to know the beginner-level Pandas and Matplotlib library to get started with machine learning.
What is Linear Regression?
In simple language, linear regression describes the relationship between dependent and independent variables. It describes the relationship by fitting a straight line. That’s why it is linear regression. Let’s understand it using a dataset. Look at this dataset where we have two variables: weight(lbs) and Height(inches).
Let’s plot them by putting Weight in the X-direction and Height in the y-direction.
We get the dots. Then if we draw a straight line through the dots that best fits the dots, we can predict the Height if we have the Weight information. For example, if the weight is 160 lbs, we can simply draw a straight line from the 160 points and draw another straight line from the same point from the straight line to the y-axis as shown in the picture below. That way we can find the Height(inches) which is about 66 inches.
Here Weight is the independent variable and Height is the dependent variable.
Here is the straight-line formula as a refresher:
Y = mX + C
Y is the dependent variable (in the picture above Height is the Y)
X is the independent variable (in the picture above Weight is the X)
m is the slope of the straight line
C is the Y-intercept
The task for a linear regression will be to find out ‘m’ and ‘C’ so that the straight line best represents all the dots. So, m and C are called training parameters for linear regression. For different machine learning algorithms, training parameters are different.
Here is the video version of the basics of linear regression:
Assumptions in Linear Regression
- Observations are independent: It is expected that the observations in the dataset are taken using a valid sampling method and they are independent of each other.
- Linear Correlation: It is also necessary for linear regression to work that the dependent variable and independent variable are linearly correlated as shown in the picture above. If not then we need to try other machine learning methods.
3. Normality: The data follows a normal distribution. Though if we have a big enough dataset we do not worry about the normality anymore.
Simple Linear Regression Example
Let’s work on an example. For this example, I will use this famous ‘iris’ dataset from Seaborn. For that, I need to import the seaborn library first. Then use the load_dataset function from the seaborn library to load the iris dataset.
import seaborn as sns iris = sns.load_dataset('iris') iris
As you can see, there are several variables in the iris dataset. For simple linear regression, we need only two variables, I will keep only petal_length and petal_width from here.
iris = iris[['petal_length', 'petal_width']]
Here is what the iris dataset looks like now:
Let’s consider petal_length as the independent variable and petal_width as the dependent variable. If we compare that with the straight-line formula that we discussed above petal_length is the X and petal_width is the ‘y’.
X = iris['petal_length'] y = iris['petal_width']
Before diving into the linear regression, we should check if our X and y are linearly correlated. I will use the Matplotlib library to make a scatter plot:
import matplotlib.pyplot as plt plt.scatter(X, y) plt.xlabel("petal length") plt.ylabel("petal width")
It shows that the relationship between the variables is linear.
In machine learning, we do not use the whole dataset to develop the model. We split the dataset into two portions. One portion is for building the model which is called the training set and the other portion is for evaluating the model which is called the test data set.
It is important to keep a portion of the dataset for testing purposes. Because our goal is to build a model that generalizes and also works well on the data that the model hasn’t seen before.
We will use the train_test_split method from the scikit-learn library for that.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 23)
Here test_size 0.4 means it will keep 40% of the data for testing purposes.
The random_state = 23 means if we use this same dataset and use random_state 23, it will recreate the same train-test split again. Random_state can be any other integer.
Let’s have a look at the X_train:
77 5.0 29 1.6 92 4.0 23 1.7 128 5.6 ... 39 1.5 91 4.6 31 1.5 40 1.3 83 5.1 Name: petal_length, Length: 90, dtype: float64
#DataScience #Python #DataAnalytics #MachineLearning #ArtificialIntelligence