Correlation tests are very common in statistics and machine learning. In statistics, a correlation test is important to understand the impact of different variables on the population. For example, say we developed a new math course for seventh-grade students. In our dataset, we have two groups of students. One group’s parental level of education is college level. Another group of students’ parents does not have college degrees. We may want to know if parents had a college education has any impact on the success of the student’s new math course.
In machine learning, correlation tests can be used for feature selection. In classification problems where the output variable is categorical and input variables are also categorical, a chi-squared test can be used to know if the input variables are even relevant to the output variable.
This article will focus on the chi-squared test to determine if there is a correlation between two variables.
We will learn the process with an example step by step.
First I will show the manual way of doing the whole process and later we will work on Python.
Here is the data we will use to demonstrate how to perform a chi-square test:
The two variables shown in the table above are taken from a big dataset. Please feel free to download the dataset from this link.
There are two columns: the grades of the students and if parents answered the survey questions. We may want to know if parents answering the survey questions has anything to do with the grades. Let’s see If these two variables are correlated.
Here is the contingency table:
You can see in each column of this contingency table, there are some differences in numbers for each grade in the ‘yes’ and ‘no’ groups. The purpose of the chi-squared test, in this case, is to know if these differences are by chance or the difference is statistically significant.
Null hypothesis: Two variables (ParentAnsweringSurvey and GradeID) are not correlated. They are independent.
To begin this test we need to start by calculating the expected values. What is the expected value?
The expected value is the value that is expected if the null hypothesis is true.
Here I am showing the calculation of two expected values for example:
The expected value for G-02 when ‘Parent Answering Survey’ is ‘no’ where original value is 74:
e02(no) = (210*147)/480 = 64.31
Here, 480 is the total count as shown in the table above.
Here is another example. The expected value of G-08 when ‘Parent Answering Survey’ is ‘yes’ where the original value is 77:
e08(yes) = (270*116)/480 = 65.25
This way we can calculate the expected values for the whole table. Here is the table that shows the expected values for every cell:
Now using the original values from the contingency table and the expected values we can calculate the chi-squared statistics. We need to calculate the chi-square values for each cell and sum them all up. Here I am showing the chi2 test-statistic calculation for the first cell:
ch2_02_no = (74–64.31)²/64.31 = 1.459
In the same way, you can calculate the rest of the cells as well. Here is the complete table:
From the table above, the total is 16.24. So, our chi2 test-statistics is 16.24.
I used an excel sheet to calculate the expected values and chi-squared test statistics.
There are a few more steps before we can reach a conlusion.
I chose the significance level of 0.05.
The degrees of freedom is:
(num_rows — 1)*(num_cols-1)
In the contingency table, the number of rows is 2 and the number of columns is 10.
So, the degrees of freedom is:
(2–1)*(10–1) = 9
Using the chi2 distribution table below, for the significance level of 0.05 and the degrees of freedom of 9, we get the chi2 critical value of 16.92.
So, the chi2 test statistic (16.24) we calculated is smaller than the chi2 critical value we have got from the distribution. So, we do not have enough evidence to reject the null hypothesis.
Based on the chi2 test the two variables (ParentAnsweringSurvey and GradeID) are not correlated.
But at the same time, the difference between the chi2 test-statistic and the chi2 from the distribution is not that big. If we chose the significance level 0.01 or 0.025, the result will be different. We will be able to reject the null hypothesis. So, it is a close call.
Here I am doing the same chi-square test using Python. It is a lot easier to make a contingency table and all the calculations using a programming language.
First I am importing the dataset here. It is the same Kaggle dataset from the link I provided before.
import pandas as pd pd.set_option('display.max_columns', 100)edu = pd.read_csv("xAPI-Edu-Data.csv")
These are the columns in this dataset:
Index(['gender', 'NationalITy', 'PlaceofBirth', 'StageID', 'GradeID', 'SectionID', 'Topic', 'Semester', 'Relation', 'raisedhands', 'VisITedResources', 'AnnouncementsView', 'Discussion','ParentAnsweringSurvey', 'ParentschoolSatisfaction', 'StudentAbsenceDays', 'Class'], dtype='object')
As in the previous example, I will use the same two variables, so you can compare the process and results. So, the null hypothesis will also be the same as before:
Null hypothesis: Two variables (ParentAnsweringSurvey and GradeID) are independent. There is no correlation.
As usual, we need to start by making a contingency table:
contingency = pd.crosstab(edu['ParentAnsweringSurvey'], edu["GradeID"])
This will provide you with exactly the same contingency table:
So, these are the observed values. For the convenience of calculation, I wanted to convert them to an array. If I grab the values only from the DataFrame, it will give me an array:
observed_values = contingency.values observed_values
array([[74, 25, 3, 16, 40, 39, 1, 1, 5, 6], [73, 23, 0, 16, 61, 77, 4, 3, 8, 5]], dtype=int64)
These are the observed values in array form. We need to calculate the expected values now. That means what values are expected in this contingency table if the null hypothesis is true.
This time I will not calculate the expected values manually. I will use Python’s ‘scipy’ library:
import scipy.stats as stats val = stats.chi2_contingency(contingency) val
(16.24174256088471, 0.06200179555553843, 9, array([[64.3125, 21. , 1.3125, 14. , 44.1875, 50.75 , 2.1875, 1.75 , 5.6875, 4.8125], [82.6875, 27. , 1.6875, 18. , 56.8125, 65.25 , 2.8125, 2.25 , 7.3125, 6.1875]]))
The output is a tuple, where the first value is the chi-square statistic 16.24. This is exactly what we got in our manual calculation as well. And see, here we did not have to calculate the chi-square statistic by ourselves. We got it already!
The second value 0.062 is the p-value. If we consider the significance level as 0.05, the p-value is bigger than the significance level alpha. So, we cannot reject the null hypothesis that the two variables (ParentAnsweringSurvey and GradeID)are independent.
The third value of output 9 is the degrees of freedom. We calculated that also manually in the previous section. We will use it to find the chi2 critical value.
We can also use the same method of comparing the chi-square statistic with the chi2 critical value from the chi-square distribution as before. We can find the chi-square critical value from the distribution using the stat library as well.
alpha=0.05 dof = 9 #degree of freedomcritical_value = chi2.ppf(q = 1-alpha, df = dof) critical_value
It is 16.92, bigger than the chi-square statistic. So, We cannot reject the null hypothesis.
The p-value, we received before while finding the expected values, was considering the significance level of 0.0.5 by default. If you consider a different significance level, you can use the chi-square critical value to find out the p-value as follows:
n_rows = 2 n_cols = 10 dof = (n_rows-1)*(n_cols-1)p_value = 1-chi2.cdf(x=val, df = dof) p_value
We got the same p-value because we used the chi-square value of the significance level 0.05. If you use a different significance level p-value will be different. Please try it yourself.
In this article, I tried to explain the process of the chi-squared test for independence test in detail using manual calculation and in python. Hope it was helpful!
#DataScience #Statistics #MachineLearning