ANOVA (Analysis of Variance) is a process to compare the means of more than two groups. It can also be used for comparing the means of two groups. But that’s unnecessary. Comparing the means between two groups only can be done using a hypothesis testing method such as a t-test.
If you need a refresher on the t-test or z-test please check this article.
This article will focus on comparing the means of more than two groups using the Analysis of Variance (ANOVA) method. This method breaks down the overall variability of a given continuous outcome into pieces.
One-Way Analysis of Variance
One way ANOVA is applicable where groups are defined based on the value of one factor. The goal is to compare means across groups.
Whether or not the sample means differ, the amount of variability within each group’s mean is an important element to consider while comparing the means and come to a conclusion.
The general rule is, if the variability between groups is smaller than the variability within the groups, it is less likely that there is a difference in means between the groups. At the same time, if the variability between groups is larger than the variability within the individual groups, there might be a difference in means between the groups.
So, in ANOVA we need both the variability between groups and the variability within the individual groups. Normally, F-statistic is used as a test statistic in ANOVA. The formula for F-statistic is:
Or, it can be expressed as:
The numerator and denominator of the F-statistic formula should be analyzed separately.
Here k is the number of groups
n = number of observations
k = number of groups
nj = Number of observations in each individual group
Sj = Standard deviation of each individual group
As a reminder, the square of the standard deviation is variance.
Before moving further on the discussion, let’s work on a real dataset and calculate a between-group variance and within-group variance to understand it more clearly:
I will use R for the calculation here. You can use any other programming language or even excel. The dataset I am using for this demonstration is an inbuilt dataset in RStudio. The name of the dataset is ‘mtcars’. These are the column names of the dataset:
 "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"  "gear" "carb"
here ‘cyl’ is a categorical variable that has three unique values 4, 6, and 8. In this article, we will check if the mean ‘hp’ or horsepower is different for different ‘cyl’ or cylinders.
Below is a boxplot that will provide a preliminary idea about the ‘hp’ data for each ‘cyl’:
boxplot(hp~cyl, data = mtcars, main = "hp by cyl", xlab = "cyl", ylab = "hp")
From the picture above we can find some information. That is, ‘cyl’ 8 seems to have a higher value but at the same time a higher range as well. 4 and 6 are closer where 4 has a higher variability, and 6 has a lower variability with some outliers.
For the calculation of between-group variance and within-group variance, we need the following information:
a. The number of observations:
b. The mean ‘hp’:
c. The mean ‘hp’ for the Individual ‘cyl’ group:
mtcars %>% group_by(cyl) %>% summarise(mean(hp))
d. The standard deviation of ‘hp’ for the individual ‘cyl’ group:
mtcars %>% group_by(cyl) %>% summarise(sd(hp))
e. The variance of ‘hp’ for each ‘cyl’ group:
mtcars %>% group_by(cyl) %>% summarise(var(hp))
Here are the expressions for Mean Square Between(MSB) and Mean Square Within(MSW):
The Mean Square Between and Mean Square Within are 52008.23 and 1437.801 respectively.
For the inference, an F-test is very common which is derived from an ANOVA table. This is also called a global test. This test lets you know if there is a significant difference between at least two groups.
The inference comes down to three common steps:
- Set up the null hypothesis, alternative hypothesis, and significance level. The null hypothesis is a general belief. In this case, the null hypothesis is, the mean ‘hp’ for all types of ‘cyl’ is equal. We will use a 95% confidence level. That means the significance level is 0.05.
If we cannot find enough evidence to support that claim, we will reject the null hypothesis, and accept the alternative hypothesis. Otherwise, we will fail to reject the null hypothesis.
The alternative hypothesis is there is a difference in means between at least two groups.
2. Find the appropriate value from the F-distribution. You can use an F-table and find that. But I will use R to calculate that. Because R is available to me and it’s very easy to calculate in R using the ‘qf’ function. You need to input three parameters. confidence level, and two degrees of freedom. Degrees of freedom are: k-1 = 3–1 = 2 and n -k = 32–3 = 29.
qf(0.95, 2, 29)
We will reject the null hypothesis if F-statistic is greater than or equal to 3.33.
3. Calculating the F-statistic
This is the general ANOVA table for finding F-statistic:
Remember we already calculated Mean Square Between (MSB) and the Mean Square Within(MSW) in the previous section. So we can simply calculate the F-statistic like this:
F = MSB/MSW = 52008.23/1437.801 = 36.17
Look, F-statistic is way higher than 3.33. So, we have enough evidence to reject the null hypothesis. That means the ‘hp’ for all types of ‘cyl’ are not the same. At least two of the group’s means are different.
Evaluating the Differences Between the Groups
In the previous section, we worked on evaluating if there is a difference in means between any two groups. We found out that there is. Now, you might be interested in knowing which groups exactly are different.
To determine that we need to perform a test on each pair. For example, here there are three different groups of ‘cyl’ (4, 6, and 8). So, one test would be on the mean ‘hp’ of 4 and 6, one on 6 and 8, one on 4 and 8. So, there will be three tests. There is a formula to determine how many tests are necessary.
If there are k number of groups, the number of tests will be k(k-1)/2
The problem with multiple comparison tests like this is we have alpha*(k(k-1))/2 chance of making mistakes.
In order to control the number of mistakes across the tests, there should be an adjustment on each test. One very popular adjustment is the Bonferroni adjustment where the significance level of each test is (alpha/c). Here c is the number of total tests.
A t-statistic will be used here for doing that.
Here is the formula for t-statistic:
Now I will use R for the rest of the testing. Because we have to perform several tests now as I mentioned earlier. R has excellent functionalities that make this job very easy.
Here is the model:
mtcars$cyl = as.factor(mtcars$cyl)
m = aov(hp~cyl, data=mtcars) summary(m)
Df Sum Sq Mean Sq F value Pr(>F) cyl 2 104031 52015 36.18 1.32e-08 *** Residuals 29 41696 1438 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice I had to convert the ‘cyl’ column as a factor. If you do not have the group column as a factor it may eventually give you some error in pairwise tests.
Look at the summary table above. The Mean Square Between, the Mean Square Within, and the F value are almost the same as we calculated manually before.
Now performing the t-tests for all the pairs can be done like this:
pairwise.t.test(mtcars$hp, mtcars$cyl, p.adj = "bonferroni")
Pairwise comparisons using t tests with pooled SDdata: mtcars$hp and mtcars$cyl 4 6 6 0.12 - 8 1.2e-08 8.7e-05P value adjustment method: bonferroni
If it is the first time for you, it may be a bit hard to understand the table here. This table gives the p-value for the difference in all the pairs. The p-value for groups 4 and 6 is 0.12, the p-value for groups 4 and 8 is 1.2e-08, and the p-value for groups 6 and 8 is 8.7e-05.
Considering the alpha level as 0.05, we do not have enough evidence that the mean ‘hp’ of ‘cyl’ 4 and 6 are the same. Because the p-value is 0.12 which is bigger than the p-value.
But there is enough evidence to reject the idea that the mean ‘hp’ of ‘cyl’ 4 and 8 is the same. Because the p-value is lower than the alpha level. The same goes for the pair ‘cyl’ 6 and 8.
There is another function called ‘TukeyHSD’ that gives the confidence interval for the difference in means for each pair and also the p-value.
Tukey multiple comparisons of means 95% family-wise confidence levelFit: aov(formula = hp ~ cyl, data = mtcars)$cyl diff lwr upr p adj 6-4 39.64935 -5.627454 84.92616 0.0949068 8-4 126.57792 88.847251 164.30859 0.0000000 8-6 86.92857 43.579331 130.27781 0.0000839
You can see the difference first, then the lower and upper bound of the confidence interval, and the adjusted p-value. The p-values look a little bit different because the adjustment method is different. But the inference is the same. Using the p-value here also we have to reject the idea that the mean ‘hp’ for ‘cyl’ 4 and 6 are the same. Because the p-value is 0.09 which is higher than the alpha level of 0.05.
Adjusting for Another Variable
All this time we used only one response variable and one explanatory variable. Now we will see how to adjust for a second response variable. This process is called One Way ANCOVA.
We will first use the “Anova” function from the ‘car’ package to see if the ‘disp’ variable is significant. It will also test if at least one pair of ‘cyl’ groups has a different mean ‘hp’.
Anova(lm(hp~cyl+disp, data=mtcars), type=3)
Anova Table (Type III tests)Response: hp Sum Sq Df F value Pr(>F) (Intercept) 24897 1 16.8314 0.0003189 *** cyl 13143 2 4.4428 0.0210962 * disp 280 1 0.1890 0.6670811 Residuals 41417 28 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If you notice ‘Anova’ function takes the linear regression model as a parameter. This article does not cover the relationship between linear regression and ANOVA. That’s another long discussion. If you are interested to explore a bit more, please check out this article where the ANOVA table is used to infer the information on a linear regression model.
Now, let’s see when the ‘disp’ variable is controlled, how the pairwise differences we found before will change. From ‘lsmeans’ package ‘lsmeans’ function will be used for that:
lsmeans(lm(hp~cyl+disp, data=mtcars), pairwise~cyl, adjust="tukey")
$lsmeans cyl lsmean SE df lower.CL upper.CL 4 90.2 20.9 28 47.4 133 6 125.1 15.9 28 92.5 158 8 201.9 19.8 28 161.3 242Confidence level used: 0.95$contrasts contrast estimate SE df t.ratio p.value 4 - 6 -34.9 21.5 28 -1.625 0.2522 4 - 8 -111.7 37.6 28 -2.967 0.0163 6 - 8 -76.7 29.5 28 -2.603 0.0377P value adjustment: tukey method for comparing a family of 3 estimates
Notice the p-value for each pair! How they changed! All the p-values are bigger than before. The p-value for the pair 4, 6 is greater than the alpha level 0.05 as before. So the mean ‘hp’ is significantly different here. But the p-value for 4–8 and 6–8 is much bigger than the ANOVA test before where we didn’t control for any other variable. They are still less than the alpha level of 0.05 and we will conclude that we fail to reject the idea that the mean ‘hp’ between 4–8 and 6–8 is the same.
But if we consider the alpha level as 0.01, then we can reject the null hypothesis or the idea that the mean ‘hp’ for the pair 4-8 and 6–8 are the same. So, adjusting for the ‘disp’ variable changed the result significantly. But it is not necessary that will happen all the time. That’s why we have to test and see what happens.
Just by seeing the boxplot of three groups, some people might conclude that the mean is different. Or just calculating the mean of the data, some may draw the conclusion that they are different. But using the analysis of variance you may see a different result like this example.
The idea is to see if you can generalize the sample as the overall population means or if the difference in the mean is significant.
This is a popular and widely used topic in statistics. I hope it was helpful!
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #DataAnalytics #Statistics #RProgramming