Understanding the data does not mean getting the mean, median, standard deviation only. Lots of time it is important to learn the variability or spread or distribution of the data. Both histogram and boxplot are good for providing a lot of extra information about a dataset that helps with the understanding of the data.
Histogram takes only one variable from the dataset and shows the frequency of each occurrence. I will use a simple dataset to learn how histogram helps to understand a dataset. Let’s import the dataset:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("Cartwheeldata.csv")
This dataset shows cartwheel data. Assume, people in an office decided to go on a Cartwheel distance competition in a picnic. And the dataset above shows the results. Let’s understand the data.
- Make a histogram of ‘Age’.
sns.distplot(df['Age'], kde =False).set_title("Histogram of age")
From the picture above, it is clear that most people are below 30. Only one person is 39and one is 54. The distribution is right-skewed.
2. Make a distribution of ‘CWDistance’.
sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of CWDistance")
Such a nice stair! It is hard to say which range has the most frequency.
3. Sometimes plotting two distribution together gives a good understanding. Plot ‘Height’ and ‘CWDistance’ in the same figure.
sns.distplot(df["CWDistance"], kde=False).set_title("Histogram of height and score")
We cannot say that there is a relationship between Height and CWDistance from this picture.
Now see, what kind of information we can extract from boxplots.
A boxplot shows the distribution of the data with more detailed information. It shows the outliers more clearly, maximum, minimum, quartile(Q1), third quartile(Q3), interquartile range(IQR), and median. You can calculate the middle 50% from the IQR. Here is the picture:
It also gives you the information about the skewness of the data, how tightly closed the data is and the spread of the data.
Let’s see some examples using our Cartwheel data.
- Make a boxplot of the ‘Score’.
From this plot, we can say,
a. The distribution is normal.
b. Median is 6
c. The minimum score is 2
d. The maximum score is 8
e. The first quartile (first 25%) is at 4
f. The third quartile (75%) is at 8
g. Middle 50% of the data ranges from 4 to 8.
h. The interquartile range is 4.
2. It can be helpful to plot two variables in the same boxplot to understand how one affects the other. Plot CWDistance and ‘Glasses’ in the same plot to see if glasses have any effect on CWDistance.
sns.boxplot(x = df["CWDistance"], y = df["Glasses"])
People with no glasses have a higher median than the people with glasses. The overall range for the people with no glasses is lower but the IQR has higher values. From the picture above, IQR is ranging from 72 to 94. But for the people with glasses overall range of CWDistance is higher but IQR ranges from 66 to 90 which is less than the people with no glasses.
3. The histograms of CWDistance for the people with glasses and no glasses separately may give more understanding.
g = sns.FacetGrid(df, row = "Glasses")
g = g.map(plt.hist, "CWDistance")
If you see, from this picture, the maximum frequency for the people with glasses is at the beginning of the CWDistance. More study is required to make an inference about the effect of glasses on CWDistance. Constructing a confidence interval may help.
I hope this article gave you some additional information about boxplot and histogram.
Some more reading recommendation:
- The confidence interval, calculation, and characteristics
- Calculation of confidence interval for population proportion and the difference in population proportion.
- Calculation of confidence interval for mean and the difference in mean.