The confidence interval, t-test, and z-test are very popular and widely used methods in inferential statistics. They are so important because, for any research or data analysis, we can only use a sample to come to a conclusion about a large population. In that case, these inferential statistical methods help us consider the errors and infer a better estimate for a larger population using a smaller sample.
You may think there is a lot to cover in one article. Yes, they are actually a lot to digest in one day. But as mentioned in the title, this article will focus on using R to construct the confidence interval and perform the t-test, or z-test.
R has some very rich libraries and great functionalities that give you the confidence interval, z or t test-statistic, p-value all at the same time in a single line of code. So, I decided to cover all of them in this article.
This article will start with the basic concepts of the confidence interval and hypothesis testing and then we will learn each concept with examples.
We will cover:
- What are the confidence interval and a basic manual calculation
2. z-test of one sample mean in R
3. t-test of one sample mean in R
4. Comparison of two sample means in R
5. Two-sided test of the sample mean and confidence interval in R
6. Test for one sample proportion and confidence interval in R
7. Test for two sample proportion and confidence interval in R
I will start with some basic theoretical ideas. If you do not understand all of it, it’s ok. Later on, when we will work through the examples and use R, it will be very easy. But it’s important to understand the theoretical ideas.
Let’s start with the confidence interval.
What is Confidence Interval?
Let’s understand it using an example. Suppose a shopping mall wants to estimate the number of customers it gets from 9 am to 12 pm on weekdays. We are talking about the average number of customers the mall has on weekdays between 9 am and 12 pm. How they will approach this problem?
They can take samples of about 100 weekdays and then calculate the mean. Suppose the calculated mean is 42 people. Assume the population standard deviation was 15.
From the Central Limit Theorem(CLT), the sample mean should be close to the true population mean. Here sample mean is the mean that was calculated using the 100 samples above. That may not be the true population mean. If we take a sample of 1000 or 10000, this sample mean may be different.
How to infer the true population means from this sample mean?
Actually, a range is inferred using the sample size, the sample mean, and the population standard deviation, and it is assumed that the true population means falls under this interval. This interval is called a confidence interval.
In this example n ≥ 30 (where n is the number of data), the sample mean is assumed to be normally distributed with the population mean (which we do not know) and a standard deviation of:
Here, sigma is the population standard deviation.
This standard deviation of 1.5 means that 95% of the sample means will fall within 2 standard deviations of the population mean (remember the 68–95–99 rule). This also implies that 95% of the time the population means will fall within 1.5 standard deviations of the sample mean.
Here is the general formula for the confidence interval:
The estimate is the sample mean (that can be obtained from the sample 100 weekdays in the example above)
The margin of error tells you how far the original population means might be from the sample mean and is calculated using this formula:
Where z is the critical value. The value of z-critical is fixed for every confidence level. Here is a chart of the most commonly used confidence level:
So the overall formula for the confidence interval becomes:
Going back to our shopping mall example above, the upper limit of the confidence interval for 95% confidence level is:
Which is 51.3 and the lower limit is:
Which is 32.7. The confidence interval is 32.7 to 51.3. That means we are 95% confident that the true mean of the number of customers in the mall on weekdays between 9 am to 12 pm will fall between 32 to 51 people.
Here I wanted to give a basic idea of the confidence interval. Later in this article, we will see how we get the confidence interval using the R functionalities in different hands-on examples.
Hypothesis testing is a process of testing or finding evidence of any claim concerning a population.
As before it is a good idea to understand it by example. Suppose I am claiming that I can swim 10 laps in a row in a 40 ft long swimming pool. Then you asked me to show it. But I could only do 4 laps and got tired. Would anyone believe my claim?
In hypothesis testing, we try to gather evidence from a particular claim. if a claim about a population is exceedingly rare than the true value we reject that claim. Probabilities are used to find out how rare the claim is. Let’s work on an example to see the process clearly.
A school teacher has introduced a new style of reading practice. Now he wants to know if his new style is helpful to the students to obtain a better score. It is not possible to check if all the students in the world do well with the new reading style. So, he decided to take a sample of 60 students. The mean score went up to 6.5. The population standard deviation is 11. Assume that we want to determine if the new reading technique helped the students improve their scores with a 95% confidence level.
Here we will follow a five-step process to perform the hypothesis test.
We will first use a manual process and then a super fast R function. Let’s see the manual process first.
State the hypotheses. Null hypothesis and alternative hypothesis need to be declared at the beginning. In this example, we can state the null hypothesis as there is no change in scores after using these new reading techniques. The null hypothesis can be expressed as:
If we do not find enough evidence for the null hypothesis, we will reject the null hypothesis and say that the alternative hypothesis is true. But what can be the alternative hypothesis? Here, the teacher wants to test if the new technique he introduced helped improve the score of the students. So, the alternative hypothesis is, mean is greater than 0.
Select the appropriate test statistic. In this example, the sample size is 60 and we know the population standard deviation. When the sample size is greater than 60 and the population standard deviation is known, a z-statistic is appropriate.
Remember, the z-statistic is different from the z critical value we used in the confidence interval.
A test-statistic relating to the mean provides a measure of how far the x-bar(sample mean) is from the mu (population means) under the null hypothesis. The formula for z-statistic is as follows:
Z follows a normal distribution with a mean of 0 and a standard deviation of 1.
Calculate the z-statistic:
The z-statistic is calculated out to be 4.58. We can calculate the p-value from the z-statistic.
You can see from the picture above, the test statistic is the point and the p-value is the area under the curve. This p-value is the probability of observing the test statistic that we observed or one that is more extreme, assuming that the null hypothesis is true.
As the z-statistic follows a standard normal distribution, you can calculate the p-value from the z-statistic using R very easily. If you need a refresher on probability distribution please check this article (especially the normal distribution part).
State the decision rule
The confidence level was 95%, so the significance level(alpha) is 5% or 0.05. We will reject the null hypothesis if the alpha value is smaller than the p-value. Otherwise, we will fail to reject the null hypothesis.
Draw the conclusion. As we can see, the p-value is 2.325e-06 which is very small and a lot smaller than the alpha value. So, we can say that we have enough evidence to reject the null hypothesis. That means the new reading technique helped students improve their scores.
If you install the package ‘asbio’ in your RStudio, finding the z-statistic and p-value is very simple. The ‘asbio’ library has this function called one.sample.z that takes the mean under the null hypothesis, the sample means (x-bar), population standard deviation, sample size(n), and the alternative hypothesis.
one.sample.z(null.mu = 0, xbar = 6.5, sigma = 11, n = 60, alternative = 'greater')
The output comes out as a table like this. Look, the z-statistic and p-value are almost the same as we calculated before.
A Scientist wanted to test if the great white sharks are on average 20 feet in length. He measured 10 great white sharks. The sample mean is calculated to be 22.27 and the sample standard deviation is 3.19. Did he find the evidence that great white sharks are longer than 20 feet in length at the α=0.05 level of significance?
We will follow the same 5 step procedure as example 1.
Setting up the hypothesis and alpha level:
The null hypothesis is the average length of the white sharks is 20 feet in length.
If we find enough evidence to reject this null hypothesis then we will be able to say that the alternative hypothesis is true. The alternative hypothesis in this case is:
That means the mean length of great white sharks is greater than 20 feet.
As mentioned in the example problem, the alpha level is 0.05.
Select the appropriate test statistic.
In this example, the sample size is only 10 and we have the sample standard deviation. We do not know the population standard deviation.
If the sample size is less than 30 and the population standard deviation is unknown, the appropriate test-statistic is a t-statistic.
This is the formula for t-statistic:
Calculate the test-statistic
Though we have the formula above. But I prefer using R that gives the t-statistic and p-value in just one line of code.
Here are the lengths of the white sharks saved in a variable x:
x = c(21.8, 22.7, 17.3, 26.1, 26.4, 21.1, 19.8, 24.1, 18.3, 25.1)
Using the t.test function in R we can now calculate the t-statistic and p-value:
t.test(x, mu = 20, alternative = "greater")
One Sample t-testdata: x t = 2.2523, df = 9, p-value = 0.02541 alternative hypothesis: true mean is greater than 20 95 percent confidence interval: 20.42247 Inf sample estimates: mean of x 22.27
Here the t-statistic is 2.2523 and the p-value is 0.02541. The degree of freedom is n-1. Here n is 10(sample size). So the degree of freedom is 9.
State the decision rule
If the p-value is less than or equal to alpha (significance level) reject the null hypothesis. Otherwise do not reject the null hypothesis.
Draw the conclusion
In this example, the p-value is 0.025 which is less than the significance level alpha(0.1). So we have enough evidence to reject the null hypothesis. That means the mean length of great white sharks is greater than 20.
The two examples above are about one sample mean. The next example will be on comparing two means.
We will use the famous heart disease dataset from Kaggle for this demonstration. Please feel free to download the dataset from this link to follow along.
Using the information in the heart disease dataset, find out if the Cholesterol level of the male population is less than the cholesterol level of the female population in the significance level of 0.05.
Set up the hypothesis and alpha level
Let’s start with the assumption that the mean cholesterol level in the male and female population is the same. The null hypothesis is:
Here, mu1 is the mean cholesterol of the male population and mu2 is the mean cholesterol of the female population.
As per the problem statement above the alternative hypothesis can be set as the mean cholesterol of the male population is less than the mean cholesterol of the female population.
The significance level alpha is 0.05.
Select the appropriate test statistic
It is very common to use a t-statistic to compare two means in statistics:
The formula of the degree of freedom may look scary. But do not worry we will get the t-statistic and degree of freedom from the t.test function in R.
Calculate the t-statistic and the p-value.
First import the Heart dataset in the RStudio.
h = read.csv('Heart.csv')
We saved the dataset in the variable ‘h’. In the dataset ‘Sex’ value 1 means the male population and 0 means the female population. Now, use the t.test function:
t.test(h$Chol[h$Sex=='1'], h$Chol[h$Sex=='0'], alternative = 'less' , conf.level = 0.95)
Welch Two Sample t-testdata: h$Chol[h$Sex == "1"] and h$Chol[h$Sex == "0"] t = -3.0643, df = 136.37, p-value = 0.001315 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -10.17916 sample estimates: mean of x mean of y 239.6019 261.7526
As per the output above, the t-statistic is -3.0643 and the p-value is 0.001315.
You will get almost a similar result with a z-test. Library ‘BSDA’ has this function z.test. here is how to use it:
z.test(h$Chol[h$Sex=='1'], h$Chol[h$Sex=='0'], alternative = "less", mu = 0, sigma.x = sd(h$Chol[h$Sex=='1']), sigma.y = sd(h$Chol[h$Sex=='0']), conf.level = 0.95)
Two-sample z-Testdata: h$Chol[h$Sex == "1"] and h$Chol[h$Sex == "0"] z = -3.0643, p-value = 0.001091 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: NA -10.26048 sample estimates: mean of x mean of y 239.6019 261.7526
Look we got the same test-statistic and very close p-value as before.
All the parameters passed in the z.test function are self-explanatory. Setting mu as 0 may be a bit confusing. It is zero because our null hypothesis is the mean cholesterol level of the male and female population is equal. That means the difference between the two means is zero. The mu parameter takes the difference in mean in the case of two mean comparisons.
If the p-value is less than the alpha we will reject the null hypothesis and otherwise, we will not reject the null hypothesis.
Draw the conclusion
As the p-value came out to be smaller than the alpha level, we have enough evidence to reject the null hypothesis. So, the cholesterol level in the male population is less than the cholesterol level in the female population.
In my next example, I will not go through the 5-step process because it is getting a bit repetitive. This example will show how to perform a two-sided z-test of mean and calculate a confidence interval using R.
Using the data from the Heart dataset, check if the population mean of the cholesterol level is 245 and also construct a confidence interval around the mean Cholesterol level of the population. Use a significance level of 0.05.
As we are checking if the cholesterol level is 245 the null hypothesis is:
Here in the problem, there is no mention of less than or greater than. We will only check if the mean cholesterol level is 245 or not. So, the alternative hypothesis should be:
We will use the z-test here as demonstrated in example 3. That will give us the z-statistic, p-value, and confidence interval everything in one simple line of code.
Because we are not comparing the two means here, we will only pass one data here and the second one will be set as zero. The ‘alternative’ parameter needs to be set as ‘two-sided’ because we are not checking for greater than or less than here. Mean can be greater or less. So, it is two-sided.
z.test(h$Chol, NULL, alternative = "two.sided", mu = 245, sigma.x = sd(h$Chol), sigma.y = NULL, conf.level = 0.95)
One-sample z-Testdata: h$Chol z = 0.56919, p-value = 0.5692 alternative hypothesis: true mean is not equal to 245 95 percent confidence interval: 240.8631 252.5230 sample estimates: mean of x 246.6931
Look the p-value is 0.5692 which is not less than or equal to the significance level of 0.05. So, we do not have enough evidence to reject the null hypothesis here. So the mean cholesterol level of the population is 245.
From the output above, the confidence interval is 240.86 to 252.52. So we are 95% confident that the original population means cholesterol level will fall between 240.86 to 252.52.
It is common to use a z.test or a t.test function to find a confidence interval in R. But remember if you are using these functions to find confidence interval the ‘alternative’ parameter has to be set as ‘two-sided’ always.
Tests for Proportions
We only dealt with the problems of means in all the previous examples. Here We will work on the population proportions in my next two examples.
The concept of the test for proportion is not too different than the tests for means. It tests how far the population proportion of a larger population from a sample proportion. Suppose we want to test the proportion of children who had some swimming lessons when they were less than 10 years old. We cannot go ask all the children in the world if they had swimming lessons when they were less than 10. So we will take a sample of 100, 1000, 5000, or the number that is affordable to us and infer the information about the large population from that sample.
The test for proportion is only valid if the sample size is large enough. A common rule of thumb is n*p0 and n*(1-p0) both have to be greater than 10 where p0 is the population proportion of interest.
The formula for z-statistic is:
In this formula, p-hat is the claimed population proportion and p0 is the population proportion under the null hypothesis. n is the sample size.
The equation for the confidence interval is:
The standard error is calculated as:
I wanted to show all the formulas briefly but as I mentioned in the beginning, this article will focus on working on R. So, I will work on two examples in R.
The local healthcare provider claimed that 50% of the population of age 29 to 77 is suffering from some type of heart disease. According to this Heart dataset, approximately 46% of the population of age 29 to 77 have heart disease. We decided to calculate the 90% confidence interval for the proportion of the population of the specified age group suffering from heart disease. Also, test if the population proportion suffering from heart disease in this specified age group is 50%.
Set the hypothesis and alpha level first
It’s already mentioned in the problem statement that the confidence interval is 90%. So the alpha level is 0.1.
The null hypothesis is:
The alternative hypothesis is:
We will use the prop.test function that will provide us with test-statistic, p-value, and confidence interval everything.
For your information, in the dataset, the age range of the sample population is 29 to 77.
To use the prop.test function we need to know the total number of data we have in the Heart dataset and how many people have heart disease.
So, we have a total of 303 data.
nrow(h[h$AHD == 'Yes',])
Out of 303 people, 139 people have heart disease in the dataset.
prop.test(139, 303, p=0.50, alternative = "two.sided", conf.level = 0.9)
1-sample proportions test with continuity correctiondata: 139 out of 303, null probability 0.5 X-squared = 1.901, df = 1, p-value = 0.168 alternative hypothesis: true p is not equal to 0.5 90 percent confidence interval: 0.4106097 0.5076377 sample estimates: p 0.4587459
Look at the output carefully. Here we got X-squared as the test-statistic which is actually chi-squared. If you take a squared root of the chi-squared stats that is a z-statistic. The p-value is 0.168 which is bigger than the alpha value. We do not have enough evidence to reject the null hypothesis. That means the population proportion that suffers from heart disease within the age bracket of 29 to 77 is 50% with a significance level of 0.1.
The output above shows that the confidence interval is 0.41 to 0.51.
What if we want to compare two proportions?
In the same dataset, let’s check if the population proportion of males and females with heart disease is the same with the age range of 29 to 77. Assume the significance level is 0.1.
The null hypothesis is the population proportion of males and females with heart disease is the same.
The alternative hypothesis should be the population proportion of males and females with heart disease within the specified age range is not the same:
We can use the exact same function as before here. The only change is, we need to pass the number of males with heart disease and the number of females with heart disease both as the first parameter. And in the second parameter, we need to use the total number of males and the total number of females.
We can get all the information using a table function in R:
No Yes 0 72 25 1 92 114
Now, we can use the prop.test function:
prop.test(c(114, 25), c(206, 97), alternative = "two.sided", conf.level = 0.9, correct =FALSE)
2-sample test for equality of proportions without continuity correctiondata: c(114, 25) out of c(92 + 114, 72 + 25) X-squared = 23.218, df = 1, p-value = 1.446e-06 alternative hypothesis: two.sided 95 percent confidence interval: 0.1852803 0.4060519 sample estimates: prop 1 prop 2 0.5533981 0.2577320
As you can see from the output, the p-value is less than the alpha value of 0.1. So we have enough evidence to reject the null hypothesis. That means the population proportion of males and females with heart disease is not the same.
The confidence interval is 0.19 and 0.41. That implies that we are 90% confident that the difference in population proportion of males and females with heart disease lies between 0.8 to 0.4.
In this article, I tried to work on examples of different types of problems that can use the confidence intervals, t-test, and z-test. Use several functionalities of R to perform all these statistical inferences. After each example, the interpretation of the results was clearly discussed. These are not the only tests. There are other hypothesis testing methods available. But I believe these tests should be helpful in many problems in your day-to-day work.
#DataScience #DataAnalytics #R #RProgramming