Mastering Histograms in Matplotlib

Mastering Histograms in Matplotlib

The histogram is one of the most popular plots. It is useful to understand the overall distribution of a continuous variable. So, almost in any data analysis or exploratory data analysis, or machine learning project, you will start with some histograms. In this article, I will explain how to make histograms in Matplotlib. As usual, I will start with the simplest histogram in a plot and slowly move towards some more interesting ones.

I will use the NHANES dataset that is publicly available. Please see the release notes here.

Please feel free to download the dataset from here.

Let’s just dive.

First, import the necessary packages and make the DataFrame:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("nhanes_2015_2016.csv")
df.head()

The dataset is pretty big! These are columns in the dataset:

df.columns

Output:

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1',
'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 
'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG',
'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210'], dtype='object')

The column names look pretty obscure. But I will explain the meaning of the columns I will use in this article.

One Simple Histogram

Here is the simplest histogram possible. Distribution of the Systolic blood pressure:

df['BPXSY1'].hist(bins = 15, figsize=(8, 6))
plt.xlabel("Systolic Blood Pressure")
plt.title("Distribution of Systolic Blood Pressure")
plt.show()

A very simple histogram that shows a slightly right-skewed distribution.

But when dealing with a large dataset, it is often very useful to make several distributions in one plot. The next plot will show the distributions of Body weight, height, and BMI:

df[['BMXWT', 'BMXHT', 'BMXBMI']].plot.hist(bins = 15, figsize=(10,6), alpha = 0.6)
plt.show()

Luckily in this plot, three distributions have pretty different ranges. So, not too much overlap happening. But if too much overlap happens, it is a good idea to keep the plots separate. Like if we plot two systolic blood pressures and two diastolic blood pressures in the same plot, they will overlap so much that it will be hard to understand them. In those cases, it is good to make the distributions in separate subplots:

df[['BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2']].hist(
    bins=15,
    figsize=(10, 8),
    grid = False,
    rwidth = 0.9,
)
plt.show()

It is much more interesting and useful when all four blood pressures are sitting next to each other. Because you can compare them. It will be even more useful if we can have them on the same scale especially if we need to compare them. Here we are putting them with the same x and y-axis limit:

df[['BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2']].hist(
    bins=15,
    figsize=(12, 6),
    grid = False,
    rwidth = 0.9,
    sharex = True,
    sharey = True
)
plt.show()

The same four variables but looks different. The reason is they are on the same scale. So, almost all of them shrank. Some shrank in the x-direction and some shrank in the y-direction.

We had different distributions for different variables in the two plots above. But what if I want to plot only one continuous variable based on different categories of a categorical variable?

We have a categorical variable “DMDEDUC2” that represents the education level of the population. I want to plot the distribution of systolic blood pressures of each education level.

But in this dataset education levels are expressed as numeric values.

df['DMDEDUC2'].unique()

Output:

array([ 5.,  3.,  4.,  2., nan,  1.,  9.])

I will first replace them with some meaningful string values like this:

df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})

Let’s plot the histogram of systolic blood pressure for each education level:

df.hist(column = "BPXSY1",
        by = "DMDEDUC2x",
        bins = 15,
        figsize=(10, 12),
        grid = False,
        rwidth = 0.9,
        sharex = True,
        sharey = True
       )plt.show()

Here it is! Sometimes it may be interesting to have an individual row for each histogram. Let’s see how it looks:

ax = df.hist(column = "BPXSY1",
        by = "DMDEDUC2x",
        bins = 15,
        figsize=(8, 18),
        grid = False,
        rwidth = 0.9,
        sharex = True,
        sharey = True,
        layout=(6, 1)
       )for x in ax:
    #Removing the left, right and top lines of the box of each plot
    x.spines['right'].set_visible(False)
    x.spines['left'].set_visible(False)
    x.spines['top'].set_visible(False)
  #Adding the x-ticks on each plot x.tick_params(axis=”both”, labelbottom = ‘on’) x.tick_params(axis=’x’, rotation = 0)   plt.ylim(0, 300)   vals = x.get_yticks()   #Getting the grey horizontal grids in each plot for v in vals: x.axhline(y = v, linestyle=”solid”, color = ‘gray’, alpha = 0.1)   #Adding some space between each plot plt.subplots_adjust(hspace=0.8) plt.show()  

So, these are all the histograms for today.

Conclusion

These are all the plots for histograms in Matplotlib. Hopefully, this was a helpful tutorial and you will be able to use these types of plots in your projects.

Here is a youtube video that explains the same types of plots step by step

#DataScience #DataAnalytics #DataVisualization #Matplotlib #python #DataAnalysis #histogram

Leave a Reply

Close Menu