Descriptive statistics summarize, show, and analyze the data and make it more understandable. If the dataset is large, it is hard to make any sense from the raw data. Using descriptive statistics techniques, data can become more clear, patterns might emerge and some conclusions might be evident.
But descriptive statistics do not allow us to reach any conclusion beyond that analysis part. It does not confirm any hypothesis that we have made. You need to study inferential statistics for that. I have added a few links to study inferential statistics at the end of this page.
There are a few general types of statistical measures to describe the data:
- Measures of Central Tendency: Mean, Median, Mode
- Measures of Variation: Range, Standard Deviation, and Interquartile Range
- Five Number Summary: First Quartile, Second Quartile, Third Quartile, Minimum and Maximum
- The shape of the Data: Symmetric, Left_skewed, and Right-skewed.
In this article, I will explain all four of the statistical measures and their properties.
Measures of Central Tendency
There are three common measures that indicate the center of the data set. Those are called the measure of the central tendency.
This is the most basic. Probably most of you know it already.
We calculate mean by summing up all the values and then divide it by the number of values. Here is an example dataset:
12, 18, 20, 16
The mean is = (12 + 18 +20 + 16)/4 = 16.5
The mean of the dataset is sensitive to extreme values. For example, in the above dataset if there is one more value like this:
12, 18, 20, 16, 150
The mean becomes:
(12 + 18 +20 + 16 + 150)/5 = 43.2
The mean changed drastically because of that one value. The mean becomes much larger than the rest of the values of the dataset except that 150. It quite does not represent the total dataset.
When there are extreme values in a dataset, the mean does not represent the total dataset very well.
The mean of a trimmed dataset will be representative of the total dataset. If we just trim the extreme data 150 from the dataset above, the mean will become 16.5 again and it will represent most of the data in the dataset again.
Trimming the extreme values is a common technique in statistics and also in data science.
In a numeric set of data, the median is a data point between the top 50% of the data and the bottom 50% of the data. This is a dataset for example:
13, 19, 12, 21, 9, 15, 24, 11, 14
Before finding the median, we need to sort the data. After sorting the data, it becomes:
9, 11, 12, 13, 14, 15, 19, 21, 24
The median is the middle point. In this dataset, it is 14.
What if we have one more data and the number of data is even like this one:
9, 11, 12, 13, 14, 15, 19, 21, 24, 28
In this case, the median is the average of the two middle values.
The media is = (14 + 15)/2 = 14.5
Let’s add an extreme value to this dataset:
9, 11, 12, 13, 14, 15, 19, 21, 24, 28, 278
The median of this updated dataset is 15.
So, the median is not sensitive to extreme values.
The mode is the value that appeared most frequently in a dataset.
This is an example dataset:
23, 45, 34, 32, 45, 12, 23, 37, 45
Here, 45 appeared 3 times, 23 appeared 2 times, and the rest of the values appeared once. So, the mode of the dataset is 45.
If all the data appears only once in the data, there is no mode.
If there are more numbers appear the same number of times they are all modes. If the dataset above is modified to be:
23, 45, 34, 32, 45, 12, 23, 37, 45, 23
Here 45 and 23 both appeared 3 times. So, both 23 and 45 are the modes of the dataset.
If all the data appears only once in the data, there is no mode.
Measures of Variation
The measures of center discussed above are not always the best way to describe the dataset and draw conclusions. For example. here are two datasets:
data1:74, 75, 78, 78, 80
data2: 69, 74, 78, 78, 86
The mean, median, and mode of these datasets are exactly the same. Please check for yourself.
The three most commonly used measures of variation are:
The range of the dataset is the difference between the largest and smallest value of a dataset. The range of data1 mentioned above is 6 and the range of data2 is 17. So, that gives some more perspective about the spread of the datasets.
The standard deviation represents the deviation of the datasets from the mean of the data.
Calculating this one is a bit more complicated than the previous ones. These are the steps to calculate the standard deviation:
Step1: Calculate the mean
Step2: Take the difference of each value from the mean
Step3: Take the squares of those differences and add them
Step4: Divide that outcome by the total number of values. This number is called the variance
Step5: Take the square root of the variance
Here is the formula for the standard deviation:
Let’s work out an example:
Use the data1 from above:
data1:74, 75, 78, 78, 80
Step1: The mean is = 77
Step2 and Step3: I am taking the difference of each value from the mean and square them
After calculating it gives 24.
Step4: Divide this number by the number of values in the dataset to get the variance. We have five values. So, the variance is = 24/5 = 4.8.
Step5: Taking the square root of the variance, the standard deviation comes out to be 2.19.
If you follow the same process for data2 above (data2: 69, 74, 78, 78, 86), you will get 5.6. Please feel free to try it yourself.
You see the standard deviation gives an idea about how the values are spread. In other words, how the values vary from each other.
The more the values vary in the dataset, the larger the standard deviation.
These are some trends related to standard deviation:
- For a bell-shaped distribution, 99.7% of the data lies within the 3 standard deviations of either side of the mean
- About 95% of the data should be within 2 standard deviations of the mean.
- Approximately 68% of the data lie within one standard deviation of the data.
Look at this picture from Wikipedia:
Interquartile Range will be explained in the ‘Five Number Summary’ Section below because it relates to those five numbers.
Five Number Summary
These five numbers are:
That divides the bottom 25% of the data from the top 75%. Look at the example below. In the picture below, Q1 is the first quartile.
The second quartile actually the median that divides the top 50% from the bottom 50% of the data. In the picture below Q2 denotes the second quartile.
Divides the bottom 75% of the data from the top 25%. Q3 represents the third quartile in the picture below.
Let’s see the example. Here is a sorted dataset d:
d = 33, 36, 38, 40, 41, 44, 49, 53, 56, 61, 66, 71
There are 12 values here. The first, second, and third quartile will be positioned as below:
Let’s do the calculation.
The first quartile: (38 + 40)/2 = 38
The second quartile is: (44 + 49)/2 = 46.5
The third quartile is: (56 + 61)/2 = 58.5
You are probably thinking that we have three measures here: first, second, and third quartile. How it is a five-number summary!
Two other measures are Maximum and Minimum.
They are self-explanatory. I am sure you know what they are.
But we need to understand the implications of them. Before that, we need to understand one more important term.
That is InterQuartile Range (IQR).
IQR is the difference between the third quartile(Q3) and the first quartile(Q1). So, IQR represents the variation in the middle 50% of the data. The IQR of the above-mentioned dataset is 58.5–38 = 20.5.
IQR can be very useful in determining extreme values or outliers, we talked about in calculating mean and median.
The reasonable lower limit of the dataset is
Q1–1.5 * IQR
And the reasonable upper limit of the dataset is
Q3 + 1.5 * IQR
The lower limit for the dataset d above is:
38–1.5 * 20.5 = 7.25
The upper limit for the dataset d is:
58.5 + 1.5* 20.5 = 89.25
If you notice, all the values in the dataset are in between this calculated range of lower limit and the upper limit. So, there are no outliers or extreme values in this dataset.
Shape of Data
The shape of the data represents the distribution of the data throughout the range. There are three major types of distribution in statistics.
When the values below the mean are distributed the same way as the values above the mean, the distribution is called the symmetrical distribution.
This curve is also called the bell-shaped curve that I mentioned before.
For symmetrically shaped data, the mean and median are the same
When most of the data are in the upper portion of the distribution, the shape of the dataset is left-skewed.
These are the properties of the left_skewed dataset:
a. the difference between the minimum value and median is greater than the difference between the maximum value and the median.
b. the difference between the minimum value and the first quartile is greater than the difference between the maximum value and the third quartile.
c. the difference between the first quartile and the median is greater than the difference between the third quartile and the median.
In left-skewed shape, the tail lies on the left side.
When most of the data are in the lower portion of the distribution, the curve is a right-skewed curve.
These are the properties of the right-skewed dataset:
a. the difference between the minimum value and the median is less than the difference between the maximum value and the median.
b. the difference minimum value and the first quartile is less than the difference between the third quartile and the maximum value.
c. the difference between the first quartile and the median is less than the difference between the third quartile and the median.
In right-skewed shape, the tail lies on the right side
These are the most basic and major types of measures in descriptive statistics. Nobody calculates them manually anymore. Data scientists or analysts use a programming language like python or R to derive them. These parameters are used every day in research, statistics, and data science. So, it is important to understand them clearly to understand the data.
#statistics #datascience #dataanalysis