Detailed Guide to Multiple Linear Regression Model, Assessment, and Inference

Detailed Guide to Multiple Linear Regression Model, Assessment, and Inference

Linear regression is one of those old-school statistical modeling approaches that are still popular. With the development of new languages and libraries, it is now in a much-improved version and much easier to work on. In my last article, I explained the development of a Simple Linear Regression method and analysis in detail. If you haven’t seen it, here is the link

Multiple linear regression is an extension of simple linear regression. In simple linear regression, we worked on the relationship between one independent variable or explanatory variable and one dependent variable or response variable. Simple linear regression uses this very common general formula:

y = mx + c

where,

y = dependent variable or response variable

x = independent variable or explanatory variable

m = slope

c = intercept

If x and y share a linear relationship, you can predict ‘y’ if you have the ‘x’ data available.

In statistics, beta0 and beta1 are used instead of c and m. So, the formula becomes:

Image for post

This equation is good enough when you are establishing a relationship between profit and sales, arm length and leg lengths, systolic blood pressure and diastolic blood pressure, etc. That means when there is only one explanatory variable and one response variable.

But in the real world scenario, we often want to analyze the relationship between one response variable and several explanatory variables. When the response variable is exam score, there might be several explanatory variables such as study time, attendance in school, playtime, and sleep hours. We want to analyze the relationship between all the possible explanatory variables with the response variable(exam score).

In this case, the equation of the linear regression becomes:

Equation 1

Image for post

If we think of the exam score in the example mentioned before, y is the exam score. x1, x2, and x3 are the study time, attendance at school, playtime. We need to determine the values of beta0, beta1, beta2, beta3 …..

Calculating the values of betas are very easy and straightforward in R. Let’s work on an example.

For this demonstration, we will use a dataset that contains age, weight, body mass index(BMI), and systolic blood pressure. We will consider the systolic blood pressure as the dependent variable and weight, BMI, and age as the independent or explanatory variables. I will start with age as the only explanatory variable in the beginning. Then add weight and BMI later one by one to understand the effect of each one of them on the model and on the response variable(systolic blood pressure).

Please feel free to download the dataset and follow along, if you want to practice.

Let’s import the dataset first.

data = read.csv("health_data.csv")
head(data)
Image for post

As we will examine the relationship between age and systolic blood pressure first, it will be interesting to see a scatter plot of age and systolic blood pressure. Here is the scatter plot:

plot(data$Age, data$Systolic_blood_pressure,
     main= "Systolic Blood Pressure vs Age",
     xlab = "Age", ylab = "Systolic Blood Pressure")
Image for post

It shows a linear trend. Though a lot of noise around. In R, we can directly find the linear regression model using the ‘lm’ function. I will save this model in a variable ‘m’.

m = lm(data$Systolic_blood_pressure ~ data$Age)
m

Output:

Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age)Coefficients:
(Intercept)     data$Age  
     94.872        0.635

The output shows that the intercept(beta0) is 94.872 and the slope is 0.635(beta1). We considered x1 as age. So the linear regression equation becomes:

y = 94.872 + 0.635*Age

As we considered only one explanatory variable, no x2, x3, or beta2, beta3.

Here, intercept 94.872 means that if the age is zero or very close to zero systolic blood pressure will still be 94.872. In this dataset, the minimum age in the dataset is 18(feel free to check on your own). So, talking about zero age is far out of the range of this dataset. That’s why it is not so reasonable in this case.

The slope of 0.635 means that if the age increases by 1 unit the systolic blood pressure will increase by 0.635 unit on average.

Using this equation you can calculate the systolic blood pressure of a person’s age if you know the age. For example, if a person is 32 years old, the calculated systolic blood pressure will be:

y = 94.872 + 0.635*32 = 115.192

Now, how correct this estimate is, we will determine that later in this article. This is time to add one more variable.

Add weight to the model:

This is pretty simple. In the model ‘m’, we considered only one explanatory variable ‘Age’. This time we will have two explanatory variables: Age and Weight. It can be done using the same ‘lm’ function and I will save this model in a variable ‘m1’.

m1 = lm(data$Systolic_blood_pressure ~ data$Age + data$Weight)
m1

Output:

Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age + data$Weight)Coefficients:
(Intercept)     data$Age  data$Weight  
    84.2799       0.6300       0.1386

Here, intercept(beta0) is 84.28. If you notice it is different than the intercept in ‘m’(94.87). This time slope(beta1) for Age variable becomes 0.63 which is not so different than the beta1 in model ‘m’. This slope means if Age increases by 1 unit systolic blood pressure will increase by 0.63 unit on average when the Weight variable is controlled or fixed.

On the other hand, the slope for the Weight variable(beta2) is 0.1386 means that if weight increases by 1 unit, systolic blood pressure will increase by 0.1386 unit on average when the Age variable is controlled or fixed.

The linear regression equation becomes:

y = 84.2799 + 0.63* Age + 0.1386 * Weight

If you know a person’s Age and Weight you will be able to estimate that person’s systolic blood pressure using this formula.

Adding BMI to this Model

Lastly, we add BMI to this model to see if BMI changes the dynamic of this model. Let’s use the ‘lm’ function again and save this model in a variable named ‘m2’.

m2 = lm(data$Systolic_blood_pressure ~ data$Age + data$Weight+data$BMI)
m2

Output:

Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age + data$Weight + 
    data$BMI)Coefficients:
(Intercept)     data$Age  data$Weight     data$BMI  
    89.5218       0.6480       0.3209      -0.7244

Notice the output carefully. The intercept changed again. It is 89.52 this time. The slope for Age is 0.648 now. It was 0.63 in the previous model. The slope for weight is 0.3209 while it was 0.1386 in the previous model. So, After adding the BMI in the model the value beta0, beta1 and beta2 changed pretty significantly. The slope of the BMI variable is -0.7244.

The linear regression equation becomes:

y = 89.5218 + 0.648*Age + 0.3209*Weight — 0.7244*BMI

Woo! Our multiple linear regression model is ready! Now if we know the age, weight, and BMI of a person, we will be able to calculate the systolic blood pressure of that person!

How accurate that systolic blood pressure calculation from this equation is?

Let’s find out. One very common and popular way to assess the fit of the data in multiple linear regression is the coefficient of variation (R-squared). The formula for R-squared is the same as the simple linear regression:

Image for post

Here,

y_calc is the calculated value of the response variable. In this case, the values of systolic blood pressure that are calculated using the linear regression model

y_mean is the mean of original systolic blood pressure values

y is the original systolic blood pressures from the dataset

The R-squared value represents the proportion of the response variable that can be explained by the explanatory variables.

I will use R to calculate R-squared. It is very simple in R. We have three models, and we saved them in three different variables m, m1, and m2. It will be good to see the fit of each model. I will calculate the R-squared value for all three models. Here is the R-squared value for the first model ‘m’ where the explanatory variable was only the ‘Age’.

R_squared1 = sum((fitted(m) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared1

Output:

0.3795497

That means 37.95% of the systolic blood pressure can be explained by Age.

The R-squared value for the second model ‘m1’ where explanatory variables were ‘Age’ and ‘Weight’:

R_squared2 = sum((fitted(m1) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared2

Output:

0.3958562

39.58% of the systolic blood pressure can be explained by ‘Age’ and ‘Weight’ together. Look R-squared improved a bit after adding the Weight to the model.

Lastly, the R-squared value for the model m2 where the explanatory variables were ‘Age’, ‘Weight’, and ‘BMI’.

R_squared3 = sum((fitted(m2) - mean(data$Systolic_blood_pressure))**2) / sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)
R_squared3

Output:

0.4099555

40.99% of the systolic blood pressure can be explained by the ‘Age’, ‘Weight’, and ‘BMI’.

The next steps might be a bit hard for you to understand totally if you are not familiar with confidence intervals or hypothesis test concepts. Here is an article to learn confidence interval concepts.

Here is an article on the hypothesis tests. Please check:

Inference

In this section, we will work on an F-test to see if the model was significant. That means if at least one of the explanatory variables has a linear association with the response variable.

There is a five-step process to perform this hypothesis test:

step 1:

Set the hypothesis and select the alpha level:

We set a null hypothesis and an alternative hypothesis. The null hypothesis is that the slope of all the variables is zeros. That means there is no association between any of the variables and the response variable. Here is the null hypothesis:

Image for post

If we do not find enough evidence that the null hypothesis is true then we will reject the null hypothesis. That will give us the evidence that at least one of the slopes not equal to zero. That means at least one of the explanatory variables has a linear association with the response variable. This is the alternative hypothesis:

Image for post

I am setting the alpha level as 0.05. That means the confidence level is 95%.

Step 2:

Select the appropriate test statistic. Here we will use F-test. The test statistic is:

Image for post

Here, df is the degrees of freedom. That is the number of explanatory variables. In this example that is 3 (Age, Weight, and BMI).

n is the number of rows or the number of data points. In this dataset, there are 100 rows. So, n = 100. Feel free to check it using the ‘nrows(data)’ function

We will discuss how to calculate the F-stat in a little bit.

Step 3:

State the decision rule:

We need to determine the appropriate value from the F-distribution with df = 3, n-k-1 = 100–3- 1 = 96, and the alpha = 0.05.

There are 2 ways to find the appropriate value. You can use the F distribution table. But I prefer using R. So, here is the value from the F-distribution calculated in R:

qf(0.95, 3, 96)

Output:

[1] 2.699393

The appropriate value from F-distribution is 2.699.

The decision rule is:

Reject the null hypothesis if F ≥ 2.699

Otherwise, do not reject the null hypothesis.

Step 4:

Compute the F-Statistic.

Here is the table that calculates the F-statistic.

Image for post

Here in the table above,

The Reg SS is the regression sum of squares that can be calculated with this formula

Image for post

The Res SS is the residual sum of squares and here is the expression to calculate it:

Image for post

Total SS can also be calculated as the sum of ‘Reg SS’ and ‘Res SS’. The following expression will also give you the same result as the sum of Reg SS and Res SS.

Image for post

Reg df or the regression degrees of freedom is the number of explanatory variables. In this example that is 3.

n is the number of rows of the data.

I will use R to calculate Reg SS, Res SS, and n:

regSS = sum((fitted(m2) - mean(data$Systolic_blood_pressure))**2)
resSS = sum((data$Systolic_blood_pressure - mean(data$Systolic_blood_pressure))**2)

Output:

[1] 16050.88
[1] 39091.84

Finding the number of rows in the data:

nrow(data)

Output:

[1] 100

The rest of the elements in the table can be calculated now. I used an excel sheet to generate the table. Though it is possible to find everything in R. But to make it in a table format I used excel. Here are the results:

Image for post

The F-statistic is 13.139.

Feel free to download this excel file here.

Step 5:

Conclusion: F is 13.139 which is greater than 2.699. So, we have enough evidence to reject the null hypothesis. That means At least one of the explanatory variables has a linear association with the response variable. So, the model is significant.

From the F-test, we came to know that the model is significant. That means at least one of the explanatory variables has a linear association with the response variable. It will be helpful to know exactly which variable or variables have a linear association with the response variable(systolic blood pressure).

We will perform a t-test for that.

t-test for individual explanatory variables

The F-test above shows that the model is significant. Now, we can test for each of the explanatory variables that if each of them has a linear association. As we already state five-step rule before. I will not go through the whole thing here.

Step 1: The hypothesis and the alpha are exactly the same as the F-test above.

Step 2:

In step 2, the test statistic will be the t-statistic with the degrees of freedom of n -k-1. We will use R to find the t-statistic in step 4 later.

Step 3:

In step 3, we need to find the appropriate value from the t-distribution this time. There is a ‘t-distribution’ table to find out the appropriate value. I prefer to use R.

qt(0.975, 96)

Output:

[1] 1.984984

If the t-statistic for any explanatory variable is greater than or equal to 1.985, reject the null hypothesis.

Otherwise do not reject the null hypothesis.

Step 4:

Compute the test statistic:

This is what makes a t-test that easy. If you just take a summary of your linear regression model in R that gives you the t-statistic and also the p-value. I will take the summary of the model ‘m2’ because we included all three explanatory variables in that model.

summary(m2)

Output:

Call:
lm(formula = data$Systolic_blood_pressure ~ data$Age + data$Weight + 
    data$BMI)Residuals:
    Min      1Q  Median      3Q     Max 
-33.218 -10.572  -0.187   8.171  47.071Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 90.14758    8.34933  10.797  < 2e-16 ***
data$Age     0.64315    0.08109   7.931 3.97e-12 ***
data$Weight  0.32226    0.14753   2.184   0.0314 *  
data$BMI    -0.73980    0.47751  -1.549   0.1246    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 15.49 on 96 degrees of freedom
Multiple R-squared:  0.4106, Adjusted R-squared:  0.3922 
F-statistic: 22.29 on 3 and 96 DF,  p-value: 4.89e-11

Look at the output carefully. There are t-statistic for each of the explanatory variables here. Also the p-values for each of them.

Step 5:

Here is the conclusion from the test:

As per our decision rule, we should reject the null hypothesis if the t-statistic is greater than or equal to 1.985. You can see that for Age and Weight variable t-statistic is greater than 1.985. So, we can reject the null hypothesis for both of them. Let’s talk about each of them one by one.

As per the t-test, Age variable is significant and has a linear association with the systolic blood pressure when Weight and BMI are controlled.

In the same way Weight variable is also significant and has a linear association with the systolic blood pressure when Age and BMI are controlled.

On the other hand, the t-statistic for the BMI variable is -1.549 which is smaller than 1.985. So, we do not have enough evidence to reject the null hypothesis for the BMI variable. That means the BMI variable does not have a linear association with the systolic blood pressure when Age and Weight are controlled.

You can also use the p-value to draw a conclusion. If the p-value is greater than or equal to the alpha level (in this example 0.05), we have enough evidence to reject the null hypothesis. If you notice in the summary output above, for the Age and Weight variable, the p-value is less than the alpha level 0.05. So, this way also we can conclude that Age and Weight variables have a linear association with systolic blood pressure. On the other hand, the p-value for BMI is greater than 0.05. Using p-value also you can determine that the BMI variable does not have a linear association with the systolic blood pressure.

If you read my article on simple linear regression, you may be wondering why I did not use ANOVA for the inference part here. Using ANOVA in multiple linear regression is not a good idea. Because it gives you different results if you put the response variable in a different order. It becomes very confusing. Try using the ‘anova()’ function in R with ‘Age’, Weight, BMI once. And Weight, Age, BMI once. You might get a different ANOVA table.

I hope this was helpful. This is a lot of material covered in this article. If all this material is totally new to you it may take some time to really grasp all these ideas. These are not the only tests. There are several other tests in statistics. These are just some common and popular hypothesis tests. I suggest, take a dataset of your own and try developing a linear regression model and running hypothesis tests like this article. If you read this for learning, that’s the only way to learn.

Feel free to follow me on Twitter and like my Facebook page.

Leave a Reply

Close Menu