A linear relationship between two variables is very common. So, a lot of mathematical and statistical models have been developed to use this phenomenon and extract more information about the data. This article will explain the very popular methods in statistics Simple Linear Regression (SLR).
This Article Covers:
Development of a Simple Linear Regression model
Assessment of how good the model fits
Hypothesis test using ANOVA table
That’s a lot of material to learn in one day if you are reading this to learn. All the topics will be covered with a working example. Please work on the example by yourself to understand it well.
Developing the SLR model should not be too hard. It’s pretty straightforward. Simply use the formulas and find your model or use the software. Both are straightforward.
The assessment and the hypothesis testing part may be confusing if you are totally new to it. You may have to go over it a few times slowly. I will try to be precise and to the point.
Simple Linear Regression(SLR)
When linear relation is observed between two quantitative variables, Simple Linear Regression can be used to take explanations and assessments of that data further. Here is an example of a linear relationship between two variables:
The dots in this graph show a positive upward trend. That means if the hours of study increase, exam scores also increase. In other words, there is a positive correlation between the hours of study and the exam scores. From a graph like that, the strength and direction of the correlation of two variables can be assumed. But it is not possible to quantify the correlation and how much the exam score changes with each additional hour of study. If you can quantify that, it will be possible to forecast the exam scores, if you know the hours of study. That will be very useful, right?
Simple Linear Regression(SLR) does just that. It uses this old-school formula of the straight line that we all learned in school. Here is the formula:
y = c + mx
y is the dependent variable,
x is the independent variable,
m is the slope and
c is the intercept
In the graph above, the exam Score is the ‘y’ and the Hours of Study is the ‘x’. Exam score depends on the hours of study. So, Exam Score is the dependent variable, and Hours of Study is the independent variable.
Slope and intercept are to be determined using the Simple Linear Regression.
Linear regression is all about fitting the best fit line through the points and find out the intercept and slope. If you can do that you will be able to quantify the exam score if you have the hours of study data available. Now, how accurate that estimation of exam scores will depend on some more information. We will get there slowly.
In statistics, beta0 and beta1 is the term commonly used instead of c and m. So, the equation above looks like this:
The red dotted line in the graph above should be as close as possible to the dots. The most common way of doing that is the least square regression method.
The red dotted line in the graph above is called the Least Squares Regression line. The line should be as close as possible to the dots.
Here y_hat is the estimated or predicted value of the dependent variable(exam scores in the example above).
Remember, predicted values can be different from the original values of the dependent variables. In the graph above, the original data points are scattered. But the predicted or expected values from the equation above will be right on the red dotted line. So, there will be a difference between the original y and the predicted values y_hat.
The beta0 and beta1 can be calculated using the least squared regression formulas as follows:
y_bar is the sample mean of the ‘y’ variable.
x_bar is the sample mean of the ‘x’ variable.
Sx is the sample standard deviation of the ‘x’ variable
Sy is the sample standard deviation of the ‘y’ variable
Example of Developing a Linear Regression Model
I hope the discussion above was clear. If not, that’s ok. Now, we will work on an example that will make everything clear.
Here is the dataset to be used for this example:
This dataset contains arm lengths and leg lengths of 30 people. The scatter plot looks like this:
Please feel free to download this dataset and follow along.
There is a linear trend here. Let’s see if we can develop a linear regression equation using data that may reasonably predict the leg length using the arm length.
Arm length is the x-variable
Leg length is the y-variable
Let’s have a look at the formulas above. If we want to find the calculated values of y based on the arm length, we need to calculate the beta0 and beta1.
Required parameters to calculate the beta1: correlation coefficient, the standard deviation of arm lengths, and the standard deviation of the leg lengths.
Required parameters to calculate the beta0: mean of leg lengths, beta1, and the mean of arm lengths.
All the parameters can be calculated very easily using the dataset. I used R to calculate them. You can use any other language, you are comfortable with.
First, read the dataset into RStudio:
al = read.csv('arm_leg.csv')
I already showed the whole dataset before. It has two columns: ‘arm’ and ‘leg’ which represent the length of the arms and the length of the legs of people respectively.
For the convenience of calculation, I will save the length of the arms and the length of the legs in separate variables:
arm = al$arm leg = al$leg
Here is how to find the mean and the standard deviation of the ‘arm’ and ‘leg’ columns:
arm_bar = mean(arm) leg_bar = mean(leg)s_arm = sd(arm) s_leg = sd(leg)
R also has a ‘cor’ function to calculate the correlation between two columns:
r = cor(arm, leg)
Now, we have all the information we need to calculate beta0 and beta1. Let’s use the formulas for beta0 and beta1 described before:
beta1 = r*s_leg/s_arm beta0 = leg_bar - beta1*arm_bar
The beta1 and beta0 are 0.9721 and 1.9877 respectively.
I wanted to explain the process of working on a linear regression problem from the scratch.
Otherwise, R has the ‘lm’ function to where you can simply pass the two variables and it outputs the slope(beta1) and intercepts(beta0).
m = lm(leg~arm)
Call: lm(formula = leg ~ arm)Coefficients: (Intercept) arm 1.9877 0.9721
Plugging in the values of slope and intercept, the linear regression equation for this dataset is:
y = 1.9877 + 0.9721x
If you know a person’s arm length, you can now estimate the length of his or her legs using this equation. For example, if the length of the arms of a person is 40.1, the length of that person’s leg is estimated to be:
y = 1.9877 + 0.9721*40.1
It is 40.99. This way, you can get the length of legs of other people with different arm lengths as well.
But remember this is just an estimate or a calculated value of the length of that person’s legs.
One caution though. When you use the arm length to calculate the leg lengths, remember not to extrapolate. That means be aware of the range of the data you used in the model. For example, in this model, we used the arm lengths between 31 to 44.1 cm. Do not calculate the leg lengths for an arm’s length of 20 cm. That may not give you a correct estimation.
Interpreting the slope and estimate in plain language:
The slope of 0.9721 represents that if the length of arms changes by one unit, the length of legs will increase by 0.9721 unit on average. Please focus on the word ‘average’.
Every person who has an arm length of 40.1, may not have a leg length of 40.99. It could be a little different. But our model suggests that on average, it is 40.99. As you can see not all the dots are on the red line. The red dotted line is nothing but the line of all the averages.
The intercept 1.9877 means, if the length of the arms is zero, still the length of legs will be 1.9877 on average. The length of arms is zero is not possible. So, in this case, it is only theoretical. But in other cases, it is possible. For example, think of a linear relationship between the hours of study vs the exam score. There might be a linear relationship such that exam score increases with the hours of study. But even if a student did not study at all, s/he still may obtain some score.
How good this estimate is?
This is a good question, right? We can estimate. But how close this estimate is to the real length of that person’s leg.
To explain that we need to see the regression line first.
Using the ‘abline’ function a regression line can be drawn in R:
plot(arm, leg, main="Arm Length vs Leg Length", xlab="Length of Arms", ylab = "Length of Legs") abline(m, lty = 8, col="red")
Look at this picture. The original points (black dots) are scattered around. The estimated points will fall straight on the red dotted line. In that case, a lot of times the estimated length of legs will be different than the real length of legs for this dataset.
So, it is important to check how well the regression line fits the data.
To find that out we need to really understand y-variables. For any given data point, there might be three y-variables to consider.
- There are real or observed y-variable (that we get from the dataset. In this example the length of the legs). Let’s call each of these ‘y’ data as ‘y_i’.
- The predicted y-variable (the leg length that we can calculate from the linear regression equation. Remember that might be different than the original data point y_i.). We will call it ‘y_ihat’ for this demonstration.
- The sample average of y-variable. That we already calculated and saved it in a variable ‘y_bar’.
For assessing, how well the regression model fits the dataset, all these y_i, y_ihat and y_bar will be very important.
The distance between y_ihat and y_bar is called the regression component.
regression component = y_ihat — y_bar
The distance between the original y point y_i and the calculated y point y_ihat is called the residual component.
residual component = y_i — y_ihat
A rule of thumb is the regression line that fits the data well will have a regression component bigger than the residual component across all data points. In contrast, a regression line that does not fit the data well will have the residual component larger than the regression component across all data points.
Make sense, right? If the observed data points are too different than the calculated data points then the regression line did not fit well. If all the data points fell on the regression line, then the residual component will be zero or close to zero.
If we add the regression component and the residual component:
Total = y_ihat — y_bar + y_i — y_ihat = y_i — y_bar
How to quantify this? You can simply deduct the mean ‘y’ (y_bar) from the observed y values(y_i). But that will give you some positive and some negative values. And negative values and positive values will cancel each other. That means, this will not represent the real differences of the mean ‘y’ and observed y values.
One popular way to quantify this is to take the sum of squares. That way, there won’t be any negatives.
The total sum of squares or ‘Total SS’ is:
The regression sum of squares or ‘Reg SS’ is:
The residual sum of squares or ‘Res SS’ is:
Total SS can also be calculated as the sum of ‘Reg SS’ and ‘Res SS’.
Total SS = Reg SS + Res SS
Everything is ready! Now it’s time to calculate the R-squared value. As discussed before, R-squared is the measure that represents how well the regression line fits the data. Here is the formula for R-squared:
R-squared = Reg SS / Total SS
If the R-squared value is 1, that means, all the variation in the response variable (y-variable) can be explained by the explanatory variable (x-variable).
On the contrary, if the R-squared value is 0, that means, none of the variations in the response variable can be explained by the explanatory variable.
This is one of the most popular ways of assessments of the fit of the model to the data.
Here is the general form of the ANOVA table. You already know some of the parameters used in the table. We will discuss the rest after the table.
The relationship and parameters in this table are very important in regression analysis. This actually helps to assess the model for us. We already learned the terms Reg SS, Res SS, and Total SS and how to calculate them.
‘Reg df’ in the table above is the degrees of freedom of the regression sum of squares. This is equal to the number of parameters that are estimated except the intercept. In a Simple Linear Regression(SLR), it is 1. For Multiple Regression k > 1.
‘Res df’ is the degrees of freedom of the residual sum of squares. It is calculated as the number of data points(n) minus k minus 1 or (n-k-1). As we mentioned before, for SLR k is always 1. So, Res df for SLR is n-2.
The p-value is the probability that the observed value of the test statistic or a more extreme value could have been observed.
One more term needs to be mentioned here. If you calculate R-squared in R it will give you two R-squared values. We already discussed one R-squared value and the calculation method before. But there is another one. That is adjusted R-squared. Here is the formula:
Here, Sy is the standard deviation of the y-variable. It represents the proportion of variance of y variable that can be explained by the model.
For large n (n = the number of data points):
All the tables and equations are ready. Let’s assess the model we developed before!
Calculating the R-squared and ANOVA table to assess the model and inference from it
First, generate a table with all the parameters:
feel free to download the excel file from this link so you can see the implementation and formulas.
Notice at the end of the table. We calculated the ‘Total SS’ using the formula and also as the summation of ‘Reg SS’ and ‘Res SS’. Both the ‘Total SS’ are almost the same (490.395 and 490.372). We can use either of them. From this table:
Total SS = 490.372
Reg SS = 261.134
Res SS = 229.238
Calculate the R-squared and R-squared-adj:
R-squared = Reg SS / Total SS = 261.134/490.372 = 0.5324
R-squared-adj = 1–8.187/(s_leg)**2 = 0.5159
As expected they are almost the same.
That means 51.59% variability of the length of the legs can be explained by the length of the arms.
This R-squared value provides a good estimate of the relationship between arm length and leg length.
But to affirm that there is a significant linear relationship between these two variables a hypothesis test is necessary.
If you are totally new to hypothesis testing, you may think that why do we need to affirm that? We already developed the model and calculated the correlation.
But we studied only 30sample and developed the model on these 30 samples. If we want to infer a conclusion about the total population from it we need hypothesis testing.
Here is a detailed article on hypothesis testing concepts:
In this example, we will use the ANOVA table we described before for the hypothesis testing.
Hypothesis Test Example Using the ANOVA Table
There are two different equivalent tests to assess these hypotheses: 1) t-test and 2) F-test.
I chose to do it using F-test. If you already know how to do perform a t-test already, feel free to go ahead with that. For me, both F-test and t-test have the same amount of work. So, either one is good. Here is how to perform an F-test
There is a five-step process of this F-test. This is almost a general rule. You will be able to use this same process in many other problems.
Set up the hypothesis: We set two hypotheses in the beginning. Null hypothesis and alternative hypothesis. Then based on the evidence, we reject or fail to reject the null hypothesis.
beta1 = 0
Remember from the linear regression equation that beta1 is the slope of the regression line. We set the null hypothesis as beta1 = 0 means that we assume that there is no linear association between the arm length and the leg length.
beta1 != 0
The alternative is beta1 is not equal to zero means that there is a linear association between the arm length and the leg length.
Setting the significance level alpha =0.05. That means a 95% confidence level.
If you need a refresher in the confidence interval concept, please check out this article.
Select the appropriate test statistic. Here we are selecting F-statistic.
Define the decision rule. That means to make the decision when to reject the null hypothesis.
Since this is an F-test, we need to determine the appropriate value from the F-distribution. You can use the table to determine the F value. But the table does not include all the F values. I prefer using R. It’s very simple and easy. R has this ‘qf’ function that takes the confidence level, and the degrees of freedom. We already discussed two types of degrees of freedom: ‘Reg df’ and ‘Res df’.
qf(0.95, df1 = 1, df2 = 28)
So if the F is greater than or equal to 4.196, reject the null hypothesis. Otherwise, do not reject the null hypothesis. This is our Decision rule.
Calculate the test statistic.
There are two ways, I will show here. First I will do it manually to show the steps. And then I will simply use the ‘anova’ function from R. We already know the ‘Reg SS’, ‘Res SS’, and degrees of freedoms. So, here is the ANOVA table:
Please feel free to download the original excel file where I did all these calculations.
Notice, I did not calculate the p-value in the table. Because I wanted to show the calculation here. I will use R to calculate the p-value from F-statistic.
1-pf(31.896, 1, 28)
You can get the ANOVA table directly from the ‘anova’ function in R. The ‘anova’ function takes the linear regression model. Remember we got the linear regression model in the beginning and saved it in the variable ‘m’. Please go back and check. We will pass that ‘m’ in the ‘anova’ function to get the ‘anova’ table using R:
Analysis of Variance TableResponse: leg Df Sum Sq Mean Sq F value Pr(>F) arm 1 261.16 261.157 31.899 4.739e-06 *** Residuals 28 229.24 8.187 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Look at the output carefully. ANOVA table starts with Df (degrees of freedom), Sq Mean(SS(Sum of Squares) in the calculated table before), Mean Sq (MS (Mean Square)), F-value, and p-value. If you notice the values, they are pretty the same.
Draw the conclusion. We defined the decision rule before that we will reject the null hypothesis if the F ≥ 4.196. F-value is 31.899. So we can reject the null hypothesis. That means we have enough evidence that there is a significant linear relationship between arm length and leg lengths in alpha = 0.05 level. Our p-value is also less than alpha. That gives another evidence that we can reject the null hypothesis.
If you could finish all those, congratulations! That’s a lot of work. This is one of the simplest model and yet popular. Lots of other models are based on linear regression. It is important to learn this very well and grasp the basic concept. Hypothesis testing is also a common everyday task in statistics and data analytics for that matter. So, this article covered a lot of useful and widely used material. Hope this was helpful.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #DataAnalytics #Statistics #MachineLearning #LinearRegression #SimpleLinearRegression