PairPlot and PairGrid in Python

PairPlot and PairGrid in Python

When I first learn to make the pairplots, I used them in every project for some time. I still use them a lot.  and can be made using just one simple function. Even The most basic one is very useful in a data analytics project where . We know that a scatter plot is widely used to present the relationship between two continuous variables. . This article is a tutorial on how to make Pairplots of different styles.

This article will cover:

  1. Pair plot using Pandas and Matplotlib
  2. More Stylish and informative Pair plots using Seaborn library
  3. Use of the ‘PairGrid’ function to make more dynamic pair plots

Dataset

I used this famous dataset called the ‘nhanes’ dataset. I find it useful because it has a lot of continuous and categorical features. Please feel free to download the dataset from this link to follow along.

 
The column names may look a bit strange if this dataset is new to you. But don’t worry about it. I will explain what they mean when I will use them.

Pair plot

Let’s import the dataset first

import pandas as pd
df = pd.read_csv("nhanes_2015_2016.csv")

The dataset is pretty big. So, I cannot show a screenshot here. These are the columns in the dataset:

df.columns

Output:

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR', 'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR', 'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2', 'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST', 'HIQ210'], dtype='object')

We will start with the most basic ‘pairplot’ using the Pandas library. It is called ‘scatter_matrix’ in Pandas library.

Before that, I should mention I will use a part of the dataset. Because if I use too many features the scatter_matrix or pair plot whatever you call it, will not be very helpful. Each plot in it will be too small. Five continuous variables were chosen for this demonstration: ‘BMXWT’, ‘BMXHT’, ‘BMXBMI’, ‘BMXLEG’, ‘BMXARML’. They represent the weight, height, BMI, leg length, and arm length of the population.

We need to import scatter_matrix from the pandas’ library and then simply use the scatter_matrix function.

import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrixscatter_matrix(df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML']], figsize = (10, 10))
plt.show()

Output:

You get the bivariate to scatter plots of all the combinations from the variable that is given.  given to the scatter_matrix function.

This same plot can be obtained using the seaborn library. As you know seaborn library comes with some default style. So, the most basic pairplot is also a bit more stylish than the scatter_matrix. Instead of the histogram, I chose density plot for diagonal plots.

I will . So, I will specify ‘. If you want to keep the histogram just avoid specifying anything. Because the default type is the histogram.

import seaborn as sns
d = df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML']]
sns.pairplot(d, diag_kind = 'kde')

The next plot includes one more variable to this default plot. This time I will get all the plots in the pair plot segregated by the gender variable. At the same time, I am 
d = df[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'RIAGENDR']]sns.pairplot(d, diag_kind = 'kde', , plot_kws={'alpha':0.5, 'edgecolor': 'k'})

Output:

Paiplot and scatter_matrix both are based on scatter plots. PairGrid brings a bit more flexibility to it.

PairGrid

Using PairGrid, an empty grid can be generated. And later you can fill this up as you like. Let’s see this in action:

g = sns.PairGrid(df, vars = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML'], hue = 'RIAGENDR')

This line of code will provide an empty grid as follows:




Now, we will fill up those empty boxes. I will use the histograms for the diagonal plots and the rest will stay the scatter plot as before. It does not have to be a scatter plot. It can be any other bivariate plot. We will see an example in a bit:

g.map_diag(plt.hist, alpha = 0.6)
g.map_offdiag(plt.scatter, alpha = 0.5)
g.add_legend()

We segregated the plots using the gender parameter to see the distributions and scatter plots separately for males and females using the ‘hue’ parameter. The next plots use a continuous variable as a hue parameter. I chose ‘BPXSY1’ which means the systolic blood pressure for this. I am also going to add a condition to the dataset. I will use the data only for the ‘age’ over 60.

g = sns.PairGrid(df[df["RIDAGEYR"]>60],
   vars = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG',  'BMXARML'], hue = "BPXSY1")g.map_diag(sns.histplot, hue = None)
g.map_offdiag(sns.scatterplot)
g.add_legend()

You can see the  and darker the color higher the systolic blood pressure.

Look! The lower triangle and the upper triangle have almost the same plots. If you just switch the axis of the plots of the lower triangles, you get the plots of upper triangles. So, if you want to see different types of plots in the lower and the upper triangle of the Pairplot, PairGrid provides that flexibility. We just have to  of the pairplot and .

g = sns.PairGrid(df[df["RIDAGEYR"]>60],
    vars = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML'],
                hue = "RIAGENDR")g.map_lower(plt.scatter, alpha = 0.6)
g.map_diag(plt.hist, alpha = 0.7)

This will plot only the diagonals and the lower triangle.

Let’s fill up the upper triangle as well. I will put the density plots with shades.

g.map_upper(sns.kdeplot, shade =True)

I think it can be more useful to have two different types of plots instead of almost the same plots in both the triangles.

Again, if you do not want any other type, you can totally avoid either the upper or lower triangle. Here I am using diag_sharey = False to avoid the upper triangle.

g = sns.PairGrid(df[df["RIDAGEYR"]>60], diag_sharey = False, corner = True, vars = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML'],
hue = "RIAGENDR")g.map_lower(plt.scatter, alpha = 0.6)
g.map_diag(plt.hist, alpha = 0.7)

Output:

Here it is! The triangle is totally gone!

Conclusion

Here I tried to introduce a few different ways to make and use pairplots. I also introduced PairGrids for more flexibility to the pair plots. Hopefully, this was helpful and you will try the documentation for more styles and options.

Here is the video from of the same content:

Please feel free to follow me on Twitter.

#DataVisualization #DataScience #DataAnalytics #Python

Leave a Reply

Close Menu