One of the most basic, popular, and powerful statistical models is logistic regression. If you are familiar with linear regression, logistic regression is built upon linear regression. It uses the same linear formula just a bit different implementation. This article will discuss the details of logistic regression in R. But for a refresher or better understanding, I will discuss some formulas behind the model.
Simple Logistic Regression
As I mentioned before this uses the same linear formula as linear regression. Then what is the difference between linear regression and logistic regression? In linear regression, the outcome variable is a continuous variable. But in logistic regression, the outcome variable is a categorical variable. In a simple logistic regression model, the outcome is simply yes or no (or 1 or 0), a dichotomous variable. It predicts the probability of an event happening or not happening. If you know the probability of an event happening, you automatically know the probability of that event not happening.
In other words, logistic regression deals with the odds of an event. Here is the formula for logistic regression,
Do you see the similarity of this equation to the equation of the straight line:
Y = mx + c
Where Y represents L, m represents beta1 and c represents beta0. There is just an error term at the end of the logistic regression formula.
Solving the logistic regression equation above for p we can get,
The output of this equation will be a decimal number from 0 to 1.
The output of the logistic regression model is the probability of an event.
The goal of a logistic regression model is to find out the beta values. If we can find those values logistic regression model is completed for the prediction.
As you can see that the calculation of beta values in this equation is a bit more complicated than the linear regression equation. But the good news is we are not going into that complication. Because R has all the great functionalities that will handle the complicated calculations behind the scenes.
A Working Example
The Heart dataset from Kaggle will be used for this demonstration. If you follow my articles regularly, you know that I use that dataset a lot. Simply to save some time to find a suitable new dataset I reuse the same dataset wherever it works.
Here I am importing the dataset:
d = read.csv('Heart.csv')
If you examine the dataset, towards the end there is a column named ‘AHD’ that contains the values 1 and 0. Here 1 means a person with heart disease and 0 means who hasn’t. This ‘AHD’ is our outcome variable.
To start with, only one predictor variable will be used. I choose the ‘Chol’ variable for that. Using the ‘glm’ function in R, the model can be developed in one line of code. Here is the simple one line of code that will provide you with the beta values:
m = glm(d$AHD ~ d$Chol, family = binomial) m
Call: glm(formula = d$AHD ~ d$Chol, family = binomial)Coefficients: (Intercept) d$Chol -0.989822 0.003339Degrees of Freedom: 302 Total (i.e. Null); 301 Residual Null Deviance: 418 Residual Deviance: 415.8 AIC: 419.8
Here intercept is -0.9898 (that is beta0) and the slope for the ‘Chol’ variable (beta1) is 0.0033.
Using this formula, you can calculate the probability of heart disease if the cholesterol level is known. For example, if a person has a cholesterol level of 220, the probability of that person having heart disease is:
Calculating the above expression the probability of having a heart disease is 0.43 or 43% if the cholesterol level is 220.
Odds and Odds Ratios
If you remember, in the logistic regression equation in the beginning p/(1-p) was the odds of an event happening. The logistic regression model totally based on the odds of an event. In this example, the model will be based on the odds of heart disease. If the logistic regression equation is solved for p/(1-p), the expression to calculate the odds become:
In other words:
Using this equation, the odds for a person with a cholesterol level of 220 of having heart disease can be estimates as:
Why the ‘odds’ is important?
The ‘odds’ gives the idea about how important the explanatory variable is to predict the outcome variable.
The closer the ‘odds’ is to 1, the explanatory variable is less important.
Why? Notice, the ‘odds’ is p/1-p. Here p represents the probability of an event happening and 1-p means the probability of an event not happening. Odds will become 1 if the probability of an event happening and not happening is the same. Think of this example. If for any given cholesterol level the probability of having heart disease and not having a heart disease is the same, that means cholesterol level is not an important predictor of heart disease.
Let’s come to the odds ratio:
If we want to know how the probability changes if the cholesterol level changes by 20 units, how do we do that? Here is the equation for that:
Solving this equation:
Let’s work on an example. How much probability of heart disease changes if the cholesterol level is 245 instead of 220.
It comes out to be 1.1. That means the ‘odds’ of having a heart disease is 1.1 times higher if the cholesterol level is up by 25 (245–220) units. This change is constant for any 25 units.
Suppose you want to know that if the cholesterol level is associated with heart disease in a certain significance level such as 0.05.
There is a standard five-step procedure for this type of inference. If you are not familiar with a statistical inference or hypothesis testing, please go through this article to learn details about confidence interval, t-test, and z-test.
In this article, I am not going through that formal five-step process because the focus of this article is to demonstrate the use of R for those calculations. And that is a lot easier and shorter process. If we simply take the summary of the model ‘m’ above, that will provide us with the necessary information about the inference.
Call: glm(formula = d$AHD ~ d$Chol, family = binomial)Deviance Residuals: Min 1Q Median 3Q Max -1.573 -1.098 -1.020 1.236 1.419Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.989822 0.572848 -1.728 0.084 . d$Chol 0.003339 0.002271 1.470 0.142 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 417.98 on 302 degrees of freedom Residual deviance: 415.78 on 301 degrees of freedom AIC: 419.78Number of Fisher Scoring iterations: 4
Here is the interpretation of the summary results:
Look at the coefficients part of the summary table in the output above. The first column is an Estimate that gives you the beta values. Here, the intercept is -0.9898 which is beta0 and the slope for cholesterol level is 0.0033 which is beta1. The next column is the standard error that is not necessary as the summary table already provided us with the z-value.
- z-value for cholesterol level is 1.47. By default, the summary function is based on a 95% confidence interval. You can use a z-table to find the standard z-score from a standard normal distribution. Here I am providing the z-score for some commonly used confidence interval:
As you can see, the standard z-score for a 95% confidence interval is 1.96. If the calculated z-score is greater than or equal to 1.96, we can reject the idea that there is no association between cholesterol level and heart disease. But in this case, the calculated z-score is 1.47 which is smaller than 1.96. That means, there is no association between cholesterol level and heart disease. Cholesterol level is not a significant predictor of the probability of heart disease.
3. Another way of inference is to see the p-value. Look at the Pr(>|z|) column in the summary output above. The Pr or p-value for the cholesterol level is 0.142 which is bigger than the significance level of 0.05. That also indicates that the cholesterol level is not a significant predictor of the probability of heart disease.
Adding More Predictor and Checking for Significance
There are so many columns on this dataset. So why not add a few more variables and checking for significance. Here I am adding a few more variables: ‘Sex’, ‘RestBP’, ‘ChestPain’, and ‘MaxHR’. ‘ChestPain’ is a categorical variable. It needs to be converted to numeric.
d$ChestPain = as.integer(as.factor(d$ChestPain))
This changes the four categories of ‘ChestPain’ column ‘asymptomatic’, ‘nonanginal’, ‘nontypical’, and ‘typical’ as 1, 2, 3, and 4.
Here is the model and summary:
m3 = glm(d$AHD ~ d$Chol + d$Sex + d$RestBP + d$ChestPain + d$MaxHR, family = binomial) summary(m3)
Call: glm(formula = d$AHD ~ d$Chol + d$Sex + d$RestBP + d$ChestPain + d$MaxHR, family = binomial)Deviance Residuals: Min 1Q Median 3Q Max -2.5095 -0.7739 -0.2296 0.7185 2.8139Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.811483 1.619947 0.501 0.616419 d$Chol 0.007352 0.002903 2.532 0.011327 * d$Sex 2.022295 0.366336 5.520 3.38e-08 *** d$RestBP 0.029476 0.008698 3.389 0.000702 *** d$ChestPain -0.976384 0.174952 -5.581 2.39e-08 *** d$MaxHR -0.042467 0.007913 -5.366 8.03e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 417.98 on 302 degrees of freedom Residual deviance: 282.24 on 297 degrees of freedom AIC: 294.24Number of Fisher Scoring iterations: 5
Here I added a total of five variables. You may think why I still added the ‘Chol’ variable here? We already check in the previous section that ‘Chol’ is not a significant predictor for heart disease.
Sometimes a variable alone is not a significant predictor but when it is added with other variables it becomes significant. I just wanted to check if that happens in this case.
Here we have the slope or beta coefficients for all the added variables. Let’s calculate, what is the probability of a man having heart disease if he has a ‘Chol’ of 230, ‘RestBP of 150, ‘typical’ chest pain, and a MaxHR of 145.
The probability of that man having heart disease is 0.007. Too small!
Do you see that the p-value for the ‘Chol’ parameter is 0.011 now which is smaller than the significance level of 0.05? That means now ‘Chol’ is a significant predictor. All the other p-values are also smaller than the significance level. So, the variables used in this model are all significant predictors.
Odds Ratio and Confidence Interval
We discussed the odds ratio before. R has this ‘cbind’ function that gives odds ratio and confidence intervals for each of the predictor variables:
exp(cbind(OR = coef(m3), confint.default(m3)))
OR 2.5 % 97.5 % (Intercept) 2.2512446 0.09408516 53.8671819 d$Chol 1.0073793 1.00166344 1.0131278 d$Sex 7.5556425 3.68507064 15.4916252 d$RestBP 1.0299151 1.01250545 1.0476241 d$ChestPain 0.3766706 0.26732673 0.5307392 d$MaxHR 0.9584225 0.94367191 0.9734037
From the p-value and z-statistic, it was determined before that all the predictor variables have some association with heart disease. The odds ratio is also a measure for that. The odds ratios for ‘Chol’ and ‘RestBP’ are very close to 1. So, these two variable’s association with heart disease is not as strong.
What is the interpretation of confidence interval?
The confidence interval for the ‘Chol’ variable 1.002 to 1.01 means, we are 95% confident that the odds of having heart disease is 1.002 to 1.01 times higher for every 1 unit increase of the ‘Chol’ level.
The confidence interval for the ‘Sex’ variable is 3.69 to 15.49. In the dataset the value 0 represents female and the value 1 represents the male. So, this confidence interval means, we are 95% confident that the odds of having heart disease is 3.69 to 15.49 times higher for a male than a female.
This article focused on R implementation of Logistic Regression, inference, and interpretation of the results in simple language. Hopefully, it was helpful. This simple logistic regression model is very useful in many real-world situations. Please feel free to ask, if you have any questions.