Linear regression is the most commonly used regression model. The reason is it is simple to use, it can infer good information and it is easy to understand. In this article, we will discuss the fitting of the linear regression model to the data, inference from it, and some useful visualization.
Tools To Be Used:
- Python programming language and a few of its popular libraries. If you do not know all these libraries, you will still be able to follow this article and understand the concept.
- Jupyter Notebook environment.
Feature Selection
In my experience, the best way to learn is by using an example. I will use a dataset and keep explaining the process of fitting the model on the data and infer the information. I am using a survey dataset called NHANES dataset. It is a very good dataset for practice. Please feel free to download the dataset from this link:
Let’s import the packages and the dataset:
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import statsmodels.api as sm import numpy as npdf = pd.read_csv('nhanes_2015_2016.csv')
This dataset is quite large and we do not need all the features for this article. We will focus on the regression model where the outcome variable will be the size of the waist. If you want to work on any other variable, please make sure that you choose a quantitative variable, not a categorical variable. Because linear regression is designed to predict a quantitative variable only not a categorical variable. But you can use categorical variables as the independent variables. Check the name of all the columns to have an idea about the dataset.
df.columns#Output: Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210'], dtype='object')
So, just keep waist size and a few other variables that seem to be related to waist size and make a new smaller DataFrame. Just by instinct, anyone will guess that weight may have a strong correlation with waist size. Gender, BMI, the height may also play an important role. So, make a new smaller DataFrame with these columns only and drop any null values for the convenience.
keep = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', 'RIAGENDR']
db = df[keep].dropna()
db.head()
Linear Regression And Interpretation
We have a dataset that contains five columns: weight, height, body-mass index(BMI), waist size, and gender. As mentioned earlier, waist size is the output variable which we will try to predict using the other variables. Initially, use just one variable or one covariate to predict the waist size. Weight(BMXWT) could be a good covariate to start with because with higher weight waist size is expected to go higher. Though there are other factors like height or gender as well. But we will think about them later. We will fit the model where waist size will be expressed as a function of the weight.
model = sm.OLS.from_formula("BMXWAIST ~ BMXWT", data=db)
result = model.fit()
result.summary()
This table above may look scary to you. But most of the information is not so important for us. We only need this part of the table:
In the first column, we have coefficients. Remember the linear regression formula:
Y = AX + B
In the table above, 42.7189 is the B, and 0.6991 is our A. And we know that A is the slope. So, our slope is 0.6991. That means that if a person has one unit of extra weight s/he will have the waist size of 0.6991 unit more and that is based on the p-value mentioned in the P>|t| column. Next, the standard error is 0.005 which indicates the distance of this estimated slope from the true slope. t-statistic says that the estimated slope 0.6991 is 144.292 standard error above the zero. The last two columns are the confidence levels. By default, it is a 95% confidence level. The confidence interval is 0.69 and 0.709 which is a very narrow range. Later we will draw a confidence interval band.
db.BMXWAIST.std()
The standard deviation is 16.85 which seems far higher than the regression slope of 0.6991. But the regression slope is the average change in the waist size for a single unit shift of the weight. That means if a person is 10 units overweight than the other person, s/he will have 0.6991*10 or 6.99 unit more waist size.
Correlation
Other than these values in the small sub-table, one more parameter from the result summary is important. That is the R-squared value in the top line of the result summary. Here R-squared value is 0.795 which means 79.5% of the waist size can be explained by the weight. Now, check this regression coefficient with the squared Pearson coefficient.
cc = db[["BMXWAIST", "BMXWT"]].corr() print(cc)BMXWAIST BMXWT BMXWAIST 1.000000 0.891828 BMXWT 0.891828 1.000000
To find the R-squared value:
cc.BMXWAIST.BMXWT**2
This returns the R squared value of 0.795 again. The most important part, that is the predicted values of waist size as a function of the weight can be found in this way:
result.fittedvalues
This is just a part of the result. The original result is far bigger.
Let’s add a second covariate and see how it affects the regression performance. I choose gender as a second covariate. I want to use a relabeled gender column:
db["RIAGENDRx"] = db.RIAGENDR.replace({1: "Male", 2: "Female"})
Here is the model and model summary:
model = sm.OLS.from_formula("BMXWAIST ~ BMXWT + RIAGENDRx", data=db)
result = model.fit()
result.summary()
In the code above, BMXWT + RIAGENDRx does not mean that these two columns are joined or mathematically added. It just indicates that they both are included in the model. In this new model, waist size is expressed as the function of both weight and gender. From the result above, we can find that the coefficient of weight(BMXWT) is 0.7272 which is a bit higher than before. This time the coefficient implies that if the weight differs by one unit of the two people of the same gender, their waist will differ by 0.7272 unit. On the other hand, the coefficient of gender (RIAGENDRx) -5.0832 means, if we compare a man and a woman of the same weight, the man will have a 5.0832 unit less waist size than the woman.
All the coefficients are expressed as mean. If we compare a man of weight 70 and a woman of weight 50, the man’s waist will be about -5.0832 +(70–50)*0.7272 times different than the woman.
The regression coefficient of weight(BMXWT) changed a little, not too much after adding the gender in the model. Adding the second covariate changes the first covariate when they are somewhat correlated. Let’s check the correlation between the two covariates:
db[['BMXWT', 'RIAGENDR']].corr()
As you can see the correlation is -0.2375. So there is a correlation but not too strong. You can find the predicted values of waist size using the fitted-value method as I have shown before. I am not demonstrating that again.
Let’s add the third covariate. I chose BMXBMI as the third covariate. You may try with some other variables as well.
model = sm.OLS.from_formula("BMXWAIST ~ BMXWT + RIAGENDRx + BMXBMI", data=db)
result = model.fit()
result.summary()
Notice that after adding the BMXBMI, the coefficient for gender variable changed significantly. We can say that BMI is working as a masking part of the association between the waist size and the gender variable. On the other hand, the coefficient for weight also changed significantly. BMI worked as a masking part in the relationship between waist and weight as well.
You can add more covariates in the model and see the effect of each covariate.
Visualization Of The Model
In this section, we will visualize the result of the regression model. We will plot the regression line that is the fitted values or the predicted values with the confidence interval. If you need a refresher on the confidence interval concepts, please check this article:
Confidence Interval, Calculation, and Characteristics
What is a confidence interval, how to calculate it and its important characteristics
towardsdatascience.com
For this plot, we will fix the gender as female and the BMI as 25. Also, we need to keep one independent variable as the focus variable. We will keep it as weight(BMXWT). So, the plot will show the predicted waist size of the female of BMI 25 of all ages.
from statsmodels.sandbox.predict_functional import predict_functional values = {"RIAGENDRx": "Female", "RIAGENDR": 1, "BMXBMI": 25}pr, cb, fv = predict_functional(result, "BMXWT", values=values, ci_method="simultaneous")#Here, pr is the predicted values(pr), cb is the confidence band and #the fv is the function valuesax = sns.lineplot(fv, pr, lw=4) ax.fill_between(fv, cb[:, 0], cb[:, 1], color='grey', alpha=0.4) ax.set_xlabel("BMXWT") _ = ax.set_ylabel("BMXWAIST")
The grey area in the picture is the confidence bands. That means the true waist size will fall in that area. The width of the grey area varies along the regression line. So, the confidence interval keeps changing with ages.
You can fix the weight and see the result for a certain weight as well. Let’s fix the weight at 65 and plot the BMI vs waist for the female population. We need to change the ‘values’ parameter for that. Because we fixed the BMI value to 25 there. Now, we want to fix the weight. So, We need to delete the BMI value from the values parameter and add the weight in it.
del values["BMXBMI"] # Delete this as it is now the focus variable #del values['BMXWT'] values["BMXWT"] = 65 pr, cb, fv = predict_functional(result, "BMXBMI", values=values, ci_method="simultaneous")ax = sns.lineplot(fv, pr, lw=4) ax.fill_between(fv, cb[:, 0], cb[:, 1], color='grey', alpha=0.4) ax.set_xlabel("BMI") _ = ax.set_ylabel("BMXWAIST")
In the plots above, we plotted only the averages. The model for an average waist size for a given weight, gender, or BMI. Using this same regression model, the variance structure can also be assessed which will show how much the observed values deviated from the mean. To do just that we can plot the residuals against the fitted value. Remember, fitted values are the predicted values or observed means and the residuals are the difference between the observed means and the true value.
pp = sns.scatterplot(result.fittedvalues, result.resid)
pp.set_xlabel("Fitted values")
_ = pp.set_ylabel("Residuals")
Looks like When the observed values are lower the variance is a bit higher.
It is also possible to visualize the effect of one covariate only. We can see the component plus residual plot or the partial residual plot using only one covariate to see if we keep the other covariates fixed how the dependent variables change:
from statsmodels.graphics.regressionplots import plot_ccprax = plt.axes() plot_ccpr(result, "BMXWT", ax) ax.lines[0].set_alpha(0.2) # Reduce overplotting with transparency _ = ax.lines[1].set_color('orange')
Now, we can see the effect of BMI in the same way when the weight variable is fixed.
ax = plt.axes()
plot_ccpr(result, "BMXBMI", ax)
ax.lines[0].set_alpha(0.2)
ax.lines[1].set_color("orange")
The effect of BMI is a lot steeper than weight.
In this article, you learned how to fit a linear regression model, different statistical parameters associated with the linear regression, and some good visualization techniques. Visualization techniques were involved plotting the regression line confidence band, plotting residuals, and plotting the effect of a single covariate.
#statistics, #linearRegression, #DataScience, #DataAnalysis, #StatisticalInference, #Python, #Statsmodels