Collection of Advanced Visualization in Python

Collection of Advanced Visualization in Python

Python has very rich visualization libraries. I wrote about the visualization in Pandas and Matplotlib before. Mostly they were the basics with a touch of some advanced techniques. This is another visualization tutorial.

I decided to write a few articles on some advanced visualization techniques. This is the first one of them. In this article, I won’t work on any basic visualization. All the visualizations in this article will be some advanced visualization techniques. Some are not so advanced but this will not focus on any basic visualization. If you need a refresher on the basic plots, please have a look at this article first.

As a reminder, if you are reading for learning, please download the dataset and follow along. That’s the only way to learn. And also don’t forget to find a different dataset and apply these techniques to a new dataset.

I will try to explain as well as I can. Don’t hesitate to ask any question if you have hard time implementing the code yourself in the comment section. I will try to answer them to the best of my ability.

Here is the link to the dataset I am going to use for all the visualizations today.

I will start with some slightly problematic multivariate plots and will move towards some more sophisticated clearer solutions.

Let’s import the necessary packages and the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import warnings; warnings.filterwarnings(action="once")
df = pd.read_csv("nhanes_2015_2016.csv")

This dataset is quite big. So I am not able to show it here. But we can see the columns in the dataset here:

df.columns

Output:

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR','RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR','SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2','BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC','BMXWAIST', 'HIQ210'],dtype='object')

Probably you are thinking that the column names are so obscure!

Yes, they are. But don’t worry I will keep explaining as we go. So, it will be fine.

There are a few categorical columns in the dataset, we will use a lot. Like Gender (RIAGENDR), marital status(DMDMARTL), or education(DMDEDUC2) level. I want to convert them to some meaningful values rather than having some numbers.

df["RIAGENDRx"] = df.RIAGENDR.replace({1: "Male", 2: "Female"}) df["DMDEDUC2x"] = df.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College", 7: "Refused", 9: "Don't know"})df["DMDMARTLx"] = df.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married", 6: "Living w/partner", 77: "Refused"})

Probably the most basic plot that we learned was a line plot or a scatter plot. Here I will start with a scatter plot. But there will be a little twist to it.

For this demonstration, I will plot systolic(BPXDI1) vs systolic(BPXSY1) blood pressure. The little twist will be I will plot them in different colors for different marital statuses. It will be interesting to see if the marital status has any effect on blood pressure.

First, find out how many unique types of marital statuses are there in the dataset.

category = df["DMDMARTLx"].unique()
category

Output:

array(['Married', 'Divorced', 'Living w/partner', 'Separated',
'Never married', nan, 'Widowed', 'Refused'], dtype=object)

Now, select colors for each category:

colors = [plt.cm.tab10(i/float(len(category)-1)) for i in range(len(category))]
colors

Output:

[(0.12156862745098039, 0.4666666666666667, 0.7058823529411765, 1.0),
(1.0, 0.4980392156862745, 0.054901960784313725, 1.0),
(0.17254901960784313, 0.6274509803921569, 0.17254901960784313, 1.0),
(0.5803921568627451, 0.403921568627451, 0.7411764705882353, 1.0),
(0.5490196078431373, 0.33725490196078434, 0.29411764705882354, 1.0),
(0.4980392156862745, 0.4980392156862745, 0.4980392156862745, 1.0),
(0.7372549019607844, 0.7411764705882353, 0.13333333333333333, 1.0),
(0.09019607843137255, 0.7450980392156863, 0.8117647058823529, 1.0)]

You can explicitly make a list of the name of your favorite colors. Now we are ready to do the visualization. We will loop through each category and plot them one by one to make a total plot.

plt.figure(figsize=(16, 10), dpi=80, facecolor="w", edgecolor="k")
for i, cat in enumerate(category):
plt.scatter("BPXDI1", "BPXSY1",
data=df.loc[df.DMDMARTLx == cat, :],
s = 20, c=colors[i], label=str(cat))

plt.gca().set(xlabel=’BPXDI1′,
ylabel=’BPXSY1′)

plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title(“Marital status vs Systolic blood pressure”, fontsize=18)
plt.legend(fontsize=12)
plt.show()

 

Image for post

You can add one more variable in this dataset that will control the size of the dots. For this, I will include the body mass index(BMXBMI). I will make a separate column names ‘dot_size’ that will be body_mass index multiplied by 10.

df["dot_size"] = df.BMXBMI*10

Do the visualization now:

fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')    
for i, cat in enumerate(category):
plt.scatter("BPXDI1", "BPXSY1", data=df.loc[df.DMDMARTLx == cat, :],
plt.gca().set(xlabel='Diastolic Blood Pressure ', ylabel='Systolic blood Pressure')plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.legend(fontsize=12)
plt.show()

Image for post

Looks too crowded right! Hard to understand anything from it. You will find some solutions to this problem in our later plots.

One way to fix this type of problem is to take a random sample from the dataset.

Because our dataset is too large. If we take a sample of 500 data from it, this type of visualization will be a lot more understandable.

In the next plot, I will take the first 500 data from the dataset to plot, assuming that the whole dataset is organized randomly. I will add one more twist to it. I will add another variable. That is age. Because age can have an effect on blood pressure. Here I will encircle the data where age is more than 40. Here is the code

df2 = df.loc[:500, :]
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for i, cat in enumerate(category):
plt.scatter("BPXDI1", "BPXSY1", data=df2.loc[df2.DMDMARTLx==cat, :], s='dot_size', c=colors[i], label=str(cat), edgecolors='black', alpha = 0.6, linewidths=.5)
def encircle(x,y, ax=None, **kw):
if not ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
ax.add_patch(poly)
# Select data where age is more than 40
df_encircle = df2.loc[(df2["RIDAGEYR"] > 40), :].dropna()
# Drawing a polygon surrounding vertices
encircle(df_encircle.BPXDI1, df_encircle.BPXSY1, ec="k", fc="gold", alpha=0.1)
encircle(df_encircle.BPXDI1, df_encircle.BPXSY1, ec="firebrick", fc="none", linewidth=1.5)
plt.gca().set(xlabel='BPXDI1', ylabel='BPXSY1')plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Bubble Plot with Encircling", fontsize=22)
plt.legend(fontsize=12)
plt.show()

Image for post

What can we draw from this plot?

The bubbles that are encircled by the polygon, that many people are over 40 years old out of our 500 people in the sample.

The size of the bubble shows the body mass index. smaller the bubbles lower the body mass index and the higher the bubbles higher the body mass index. At least I cannot find any relationship between blood pressure and body mass index from this plot.

The colors are for different marital statuses. Do you see any domination of any color in any certain area? Not really. I do not see any relationship between marital status and blood pressure as well.

Stripplot

This is an interesting type of plot. When multiple data points overlap each other and it is hard to see all the points, jittering some points a little bit gives you the chance to see each point clearly. Stripplot does exactly that.

For this demonstration, I will plot systolic blood pressure vs body mass index.

fig, ax = plt.subplots(figsize=(16, 8), dpi=80)
sns.stripplot(df2.BPXSY1, df2.BMXBMI, jitter=0.45, size=8, ax=ax, linewidth=0.5)
plt.title("Systolic Blood pressure vs Body mass index")
plt.tick_params(axis='x', which='major', labelsize=12, rotation=90)
plt.show()

Image for post

Stripplots can be segregated by a categorical variable as well. But we do not need to use a loop the way we did in the scatter plot above. Stripplot has the ‘hue’ parameter that will do the job. Here I will plot Diastolic vs Systolic blood pressure segregated by Ethnic origin.

fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    
sns.stripplot(df2.BPXDI1, df2.BPXSY1, s=10, hue = df2.RIDRETH1, ax=ax)
plt.title("Stripplot for Systolic vs Diastolic Blood Pressure", fontsize=20)
plt.tick_params(rotation=90)
plt.show()

Image for post

Stripplot with Box plots

Scatter plots can be plotted together with boxplots. When there are a big dataset and a lot of dots, it provides you with a lot more information. Check for yourself here:

fig, ax = plt.subplots(figsize=(30, 12))
ax = sns.boxplot(x="BPXDI1", y = "BPXSY1", data=df)
ax.tick_params(rotation=90, labelsize=18)
ax = sns.stripplot(x = "BPXDI1", y = "BPXSY1", data=df)

Image for post

You can see the median, maximum, minimum, range, IQR, outliers in each individual point. Isn’t it great!

If you need a refresher on how to extract maximum information from boxplot, please check this article.

Stripplot with Violin Plot

We will present Marital status(DMDMARTLx) vs Age(RIDAGEYR). Let’s see how it looks first. Then we will talk about it some more.

fig, ax = plt.subplots(figsize=(30, 12))
ax = sns.violinplot(x= "DMDMARTLx", y="RIDAGEYR", data=df, inner=None, color="0.4")
ax = sns.stripplot(x= "DMDMARTLx", y="RIDAGEYR", data=df)
ax.tick_params(rotation=90, labelsize=28)

Image for post

It shows the marital status for each age range. Look at the violin for ‘married’. Throughout it is almost stable with some little bumps. ‘Living with partner’ is very high in the age range of the 30s. It drastically lowers after 40. In the same way, you can infer the ideas from the rest of the plots.

It will be even more informative if we can see violin plots segregated by gender

Let’s do that. Instead of age, let’s go back to Diastolic blood pressure. This time we will see Diastolic blood pressure vs Marital status segregated by gender. Also distribution of diastolic blood pressure by the side.

fig = plt.figure(figsize=(16, 8), dpi=80)
grid=plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)
ax_main = fig.add_subplot(grid[:, :-1])
ax_right = fig.add_subplot(grid[:, -1], xticklabels=[], yticklabels=[])
sns.violinplot(x= "DMDMARTLx", y = "BPXDI1", hue = "RIAGENDRx", data = df, color= "0.2", ax=ax_main)
sns.stripplot(x= "DMDMARTLx", y = "BPXDI1", data = df, ax=ax_main)
ax_right.hist(df.BPXDI1, histtype='stepfilled', orientation='horizontal', color='grey')ax_main.title.set_fontsize(14)
ax_main.tick_params(rotation=10, labelsize=14)
plt.show()

Image for post

Cool, right? Look how much information you can draw from this! This type of plot can be very useful for a presentation or a research report as well.

Adding a Linear Regression Line in the Bubbles

This time I will plot height(BMXHT) vs weight(BMXWT) segregated by gender(RIAGENDR). I will explain some more after making the plot.

g = sns.lmplot(x='BMXHT', y='BMXWT', hue = 'RIAGENDRx', data = df2,
aspect = 1.5, robust=True, palette='tab10',
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
plt.title("Height vs weight with line of best fit grouped by Gender", fontsize=20)
plt.show()

Image for post

You can see the segregation between male and female in the plot. ‘hue’ parameter does the segregation. It is obvious in the picture that height and weight are higher in the male population overall. There are linear regression lines for both male and female data.

Individual Bubble Plots With Regression Line

We put male and female data both in the same plot and it works because there is clear segregation and it’s only two types. But sometimes segregation is not clear and there are too many categories.

In this section, I will make the lmplot in separate plots. Height and weight may be different for different ethnic origins(RIDRETH1). Instead of gender, we will plot height and weight segregated by ethnic origins in separate plots.

fig = plt.figure(figsize=(20, 8), dpi=80)
g = sns.lmplot(x='BMXHT', y='BMXWT', data = df2, robust = True,
palette="Set1", col="RIDRETH1",
scatter_kws=dict(s=60, linewidths=0.7, edgecolors="black"))
plt.xticks(fontsize=12, )
plt.yticks(fontsize=12)
plt.show()
 

Image for post

 

Pairplot

Pair plots are very popular in exploratory data analysis. It shows the relationship of all the variables amongst each other. Here is an example. I will make a pair plot of height, weight, BMI, and waist sizes segregated by ethnic origin. I am taking the first 1000 data only because that might make the plot a bit clearer.

df3 = df.loc[:1000, :]
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df3[['BMXWT', 'BMXHT', 'BMXBMI', 'BMXWAIST', "RIDRETH1"]], kind="scatter", hue="RIDRETH1", plot_kws=dict(s=30))
plt.show()

Image for post

Diverging Bars

It gives a quick intuition about the data. You can see just in one glimpse how data deviates from one metric. Here I will show two types. The first one will involve one categorical variable on the x-axis and the second one will have two continuous variables.

Here is the first one. I will plot the housing size in the y-axis which is a categorical variable. And normalized systolic blood pressure on the x-axis. We will normalize systolic blood pressure using a standard normalization formula and segregate the data at that point.

There will be two colors. Red will denote the negative side and blue will denote the positive side.

This plot will show you how systolic blood pressure varies over housing size at a glance.

x = df.loc[:, "BPXSY1"]
df["BPXSY1_n"] = (x - x.mean())/x.std()
df['colors'] = ['red' if i < 0 else 'blue' for i in df["BPXSY1_n"]]
df.sort_values("BPXSY1_n", inplace=True)
df.reset_index(inplace=True)
plt.figure(figsize=(16, 10), dpi=80)
plt.hlines(y = df.DMDHHSIZ, xmin=0, xmax = df.BPXSY1_n, color=df.colors, linewidth=3)
plt.gca().set(ylabel="DMDHHSIZ", xlabel = "BPXSY1_n")
plt.yticks(df.DMDHHSIZ, fontsize=14)
plt.grid(linestyle='--', alpha=0.5)
plt.show()

Image for post

Here housing size has different groups. In the dataset, it does not show which group has what housing size. But you can see from the plot above that systolic blood pressure changes over housing size. The change shows very clearly. Now, you can analyze further on it.

I will make another plot where I will plot systolic blood pressure vs age. We already normalized the systolic blood pressure in the previous plot. Let’s just dive into the plot.

x = df.loc[:, "BPXSY1"]
df['colors'] = ['coral' if i < 0 else 'lightgreen' for i in df["BPXSY1_n"]]
y_ticks = np.arange(16, 82, 8)
plt.figure(figsize=(16, 10), dpi=80)
plt.hlines(y = df.RIDAGEYR, xmin=0, xmax = df.BPXSY1_n, color=df.colors, linewidth=3)
plt.gca().set(ylabel="RIDAGEYR", xlabel = "BPXSY1")
plt.yticks(y_ticks, fontsize=14)
plt.grid(linestyle='--', alpha=0.5)
plt.show()

Image for post

The variation of systolic blood pressure with age looks so evident. Overall systolic blood pressure goes upwards with growing age. Isn’t it?

Conclusion

That’s all for today. There are so many cool visualization techniques available in different python libraries. If you deal with data regularly, it is a good idea to know as many cool visualization techniques as possible. But remember, you do not need to memorize them. Just know about them and practice them a couple of times so that whenever necessary you can pull up from google, documentation, or some articles like this one. hope you will use these visualizations to do some cool work.

Feel free to follow me on Twitter and like my Facebook page.

#dataScientist #DataAnalytics #DataAnalysis #DataVisualization

Leave a Reply

Close Menu