An Ultimate Cheat Sheet for Stylish Data Visualization in Python's Seaborn Library
Data Visualization in Seaborn

An Ultimate Cheat Sheet for Stylish Data Visualization in Python's Seaborn Library

Seaborn is a python’s data visualization library that is built on Matplotlib. What so special about seaborn? Why do we need to use seaborn while we already have Maplotlib? Matplotlib can serve your purpose. It has all the visualization that you need to perform a data storytelling project. But seaborn is special because it comes in with a lot of styles. The style is already built-in. Compared to an ordinary matplotlib plot, an ordinary seaborn plot look a lot nicer!

Also, seaborn library have advanced visualization functions that are more expressive and are able to express more information more effectively.

A little bit of background. If you are new to data visualization in python or need a refresher on Matplotlib, please have a look at this article.

You can perform data visualization in Pandas as well. When you call a plot() function in pandas it uses Matplotlib in the backend. You will find a detailed guide to visualization in Pandas in this article.

The refresher part is done. Let’s dive into the Seaborn now.

I will start with the basic plots and slowly move to some more advanced ones.

I used the built-in datasets mostly. So they are easily available to anyone who has the seaborn library installed.

I will use the same variables over and over again to save time finding new datasets. My goal was to present the selections of visualization functions for you.

First import the necessary packages and the famous iris dataset:

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as snsiris = sns.load_dataset('iris')
iris
Image for post

Starting with the very basic scatter plots in Matplotlib and then Seaborn to show the difference even in the basic part in the same plots. The basic scatter plot of sepal length vs sepal width in Matplotlib:

plt.scatter(iris.sepal_length, iris.sepal_width)
Image for post

Here is the same basic plot in seaborn:

sns.set()
plt.scatter(iris.sepal_length, iris.sepal_width)
Image for post

You can see that it added a style without even writing much extra code!

I will try to keep it as precise as possible. Most of the code will be almost self-explanatory. Please take the code and run it in your own notebook, change different options and play with it if you are reading this for learning. That’s the only way to learn.

You already saw in the previous plot that how the .set() function can introduce a default seaborn style to the plot. Here is an example of set_style() function.

sns.set_style('whitegrid')
plt.scatter(iris.sepal_length, iris.sepal_width)
plt.show()
Image for post

The set_style() function has a few other style options: darkgrid, dark, white, and ticks. Please feel free to try them out.

The next plot will also be sepal length vs sepal width. But petal length parameter will also be added to it. The size of the dots will be changed according to the petal length.

sns.set_style('darkgrid')
sns.set_context('talk', font_scale=1.1)
plt.figure(figsize=(8, 6))
sns.scatterplot(iris.sepal_length, iris.sepal_width, data=iris)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Sepal Length vs Sepal Width")
plt.show()
Image for post

Bigger the dots, the bigger the petal length.

There is another new function that is introduced in this plot. That is set_context(). It controls the size of the lines, labels, and other parameters like that. In this plot ‘talk’ option was used. There are ‘paper’, ‘notebook’, and ‘poster’ options that are also available in the set_context() function. Please check them out.

One more variable can be added here comfortably. I will add species of the flower in this plot. The color of the dots will be different for different species.

sns.set_context('talk', font_scale=1.1)
plt.figure(figsize=(8, 6))
sns.scatterplot(iris.sepal_length, iris.sepal_width, 
                size="petal_length", data=iris,
               sizes=(20, 500), hue="species", 
                alpha=0.6, palette="deep")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Sepal Length vs Sepal Width")
plt.legend(bbox_to_anchor = (1.01, 1), borderaxespad=0)
plt.show()
Image for post

The relplot function is interesting and informative at the same time. Relplots can be line plots or scatter plots. Here is a line plot where each line is showing the confidence band.

If you do not want confidence band, add “ci = None” in the relplot function.

I am not doing that. Because I wanted the confidence band.

sns.relplot(iris.sepal_length, iris.sepal_width, 
                data=iris, kind='line', hue='species')
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Sepal Length vs Sepal Width")
plt.show()
Image for post

The distplot gives you the histogram, the distribution of a continuous variable. Here is a basic one.

plt.figure(figsize=(8, 6))
sns.distplot(iris.sepal_length)
plt.show()
Image for post

If you do not want that density curve, add kde = False in the distplot function. The next plot is a vertical histogram without the density curve.

plt.figure(figsize=(8, 6))
sns.distplot(iris.sepal_length, vertical=True, kde=False, color='red')
plt.show()
Image for post

Histograms can be even more informative. You can make the histograms of a continuous variable segregated by a categorical variable. To demonstrate that I will use a different dataset.

tips = sns.load_dataset("tips")
tips.head()
Image for post

This plot will show the flipper length of each species and segregated by gender.

g = sns.displot(
    tips, x="total_bill", col="day", row="sex",
    binwidth=3, height=3, facet_kws=dict(margin_titles=True))
g.fig.set_size_inches(18, 10)
g.set_axis_labels("Total Bill", "Frequency")
Image for post

So, we have the distribution of the total bill segregated by the day of week and gender.

A similar type of relplot can be made as well. The following relplot is showing the scatter plots of total bill vs tip segregated by the day of the week and the time of the day.

sns.set_context('paper', font_scale=1.8)
sns.relplot('total_bill', 'tip', data=tips, hue="time", col='day', col_wrap=2)
Image for post

Another widely used and popular plot. From the tips dataset, I will use the ‘size’ variable in the x-axis and the total bill will be plotted in the y-axis.

The total bill will be segregated by lunch and dinner time.

plt.figure(figsize=(8, 6))
sns.barplot(x='size', y= 'total_bill', hue='time', 
            palette = 'GnBu',
            data=tips, ci='sd',
           capsize=0.05,
           saturation=5,
           errcolor='lightblue',
           errwidth=2)
plt.xlabel("Size")
plt.ylabel("Total Bill")
plt.title("Total Bill Per Day of Week")
plt.show()
Image for post

Notice, I used the palette as ‘GnBu’ here. There are several different palettes available in the seaborn library. Find different palette options on this page.

If you are into statistics you will love to have this ‘ci’ option here. Otherwise, just avoid it by using ‘ci=None’.

Countplots also look like a bar plot. But it shows the count of observations for each category.

plt.figure(figsize=(8, 6))
sns.countplot(x='day', data=tips)
plt.xlabel("Day")
plt.title("Total Bill Per Day of Week")
plt.show()
Image for post

This plot shows that how many total data are available for each day of the week. The ‘hue’ parameter can also be used here to segregate it by another categorical variable. I am taking the ‘time’ variable.

plt.figure(figsize=(8, 6))
sns.countplot(x = 'day', hue='time', 
            palette = 'GnBu',
            data=tips)
plt.xlabel("Day")
plt.title("Tip Per Day of Week")
plt.show()
Image for post

This plot makes sure data do not overlap. More explanation after the plot.

plt.figure(figsize=(8, 6))
sns.set_style('whitegrid')
sns.swarmplot(x='size', y='total_bill', data=tips)
plt.xlabel("Size")
plt.ylabel("Total Bill")
plt.title("Total bill per size of the table")
plt.show()
Image for post

When the size is 1, there are only three dots and they are on the same line naturally not overlapping. But when the size is two, there are a lot of data in the same point, so by default swarmplot adjusted the position of the dots a little bit so they do not overlap each other.

This looks nice and also gives a better idea of how many data are there in each point when the dataset is not too large. If the dataset is too large swarmplot does not scale well.

In the next plot, I will add a ‘hue’ parameter that will show different colors for different genders.

plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.set(font_scale=1.5)
sns.swarmplot(x='size', y='total_bill', data=tips, hue="sex")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.legend(title="Time", fontsize=14)
plt.show()
Image for post

The segregation of gender can be separated as well,

plt.figure(figsize=(10, 6))
sns.set_style('whitegrid')
sns.set(font_scale=1.5)
sns.swarmplot(x='size', y='total_bill', data=tips, hue="sex", split=True)
plt.xlabel("Size")
plt.ylabel("Total Bill")
plt.legend(title="Time", fontsize=14)
plt.show()
Image for post

In this plot, there are separate swarms for males and females.

There is another plot called factor plot that is the same as a swarmplot but it is a facet grid plot. You can add multiple variables and present more information.

g = sns.factorplot(x='size', y="tip",
              data=tips, hue="time",
              col="day", kind="swarm",
              col_wrap=2, size=4)g.fig.set_size_inches(10, 10)
g.set_axis_labels("Size", "Tip")
plt.show()
Image for post

This plot shows the tip amount per size for each day of the week and different colors represent different times of the meal. So much information packed in one plot!

Pointplot can be very informative and more useful than bar plots. Here is a pointplot that shows the tip amount per day of the week. I will explain some more after the plot.

plt.figure(figsize=(8, 6))
sns.pointplot(x="day", y="tip", data=tips)
plt.xlabel("Day")
plt.ylabel("Tip")
plt.title("Tip Per Day of Week")
plt.show()
Image for post

The points here show the mean and the vertical lines represent the confidence interval. Sometimes less is more. Simple and yet so informative plot.

A ‘hue’ parameter can be added here to show the tip per day of the week by another categorical variable. I used gender here.

plt.figure(figsize=(8, 6))
sns.pointplot(x="day", y="tip", hue="sex", data=tips, palette="Accent")
plt.xlabel("Day")
plt.ylabel("Tip")
plt.title("Tip Per Day of Week by Gender")
plt.show()
Image for post

Differenced in tip amount by gender shows so clearly!

This is actually a scatter plot that adds a linear regression line and a confidence band.

plt.figure(figsize=(8, 6))
sns.set_style('whitegrid')
sns.regplot(x='total_bill', y='tip', data=tips)
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
Image for post

The joint plot shows two different types of plots in one plot with just one line of code. By default, it is the scatter plots in the center and the distributions of x and y variables at the edge. The ‘hue’ parameter is optional here. You use it if you need it.

sns.set_style('dark')
g = sns.jointplot(x='total_bill', y='tip', hue='time', data=tips)
g.fig.set_size_inches(8, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

This plot is a scatter plot of total bill vs tip amount segregated by the ‘time’. Different colors show the different times of the meal. The side plots show the distributions of the total bill and tip amount for both lunch and dinner time.

If you do not like the default options, there are several other options available. Here I am explicitly mentioning regplot that will be a scatter plot with a linear regression line and the confidence band along with the plot.

sns.set_style('darkgrid')
g = sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')
g.fig.set_size_inches(8, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

Instead of a scatter plots, the next plot will be a kde plot,

sns.set_style('darkgrid')
g = sns.jointplot(x='total_bill', y='tip', data=tips, kind='kde')
g.fig.set_size_inches(8, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

Feel free to use a ‘hue’ parameter in this kde plot,

sns.set_style('darkgrid')
g = sns.jointplot(x='total_bill', y='tip', hue='time', data=tips, kind='kde')
g.fig.set_size_inches(8, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

Instead of lines shaded kde plots are always more attractive to me. There is a shaded kde plot below.

plt.figure(figsize=(8, 6))
sns.set_style('whitegrid')
g = sns.kdeplot(x='total_bill', y='tip', shade=True, data=tips)
plt.xlabel("Total Bill")
plt.ylabel("Tip")
plt.show()
Image for post

The shaded plot shows the density of the data. I find it a bit more expressive.

Getting back to the jointplot, here is an example of a hexplot in a jointplot. Another beautiful plot.

sns.set_style('dark')
g = sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex')
g.fig.set_size_inches(8, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

Hexplot is especially useful when the dataset is too big.

The jitter plot is a bit like the swarm plot shown earlier. This one also adjusts the coordinates of the dots a little to avoid too much cluttering. But it’s a bit different. In the swarm plot, not a single dot was on top of another one. But in jitter plot, It spreads out only a specified amount. Here is a jitter plot below that specifies the jitter amount of 0.2. Also by default, it adds a linear regression line and a confidence band which is nice!

plt.figure(figsize=(8, 6))
sns.set_style('whitegrid')
sns.regplot(x='size', y='total_bill', data=tips, x_jitter=0.2)
plt.xlabel("Size")
plt.ylabel("Total Bill")
plt.show()
Image for post

Notice, here the x-axis contains a categorical variable.

The lmplot is a combination of regplot and facet grid. This plot can show the linear regression line and confidence band for each conditional group. It may sound a bit obscure. Please look at this plot.

sns.set(font_scale=1.5)
sns.lmplot(x='total_bill', y='tip', data = tips, 
           hue='time')
plt.gcf().set_size_inches(12, 8)
plt.ylabel("Total Bill")
plt.xlabel("Tip")
plt.show()
Image for post

Look, there are regression lines for both lunch and dinner times.

It can be even more informative. The following lmplot is showing total bill vs tip per day.

g = sns.lmplot(x='total_bill', y='tip', col="day", hue = "day", 
          data=tips, col_wrap=2, height=4)
g.fig.set_size_inches(11, 11)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

I am not going for a basic boxplot. Please have a look at my visualization tutorial with Pandas and Matplotlib I mentioned in the beginning for a refresher on the basic plots. I love boxplots because it just gives you the information on the distribution, median, IQR, outliers all in the same plot. The next plot will show the boxplots of the total bill per size.

sns.set(font_scale = 1.5)
sns.boxplot(x='size', y='total_bill', data=tips)
plt.gcf().set_size_inches(12, 8)
plt.xlabel("Size")
plt.ylabel("Total Bill")
Image for post

If you need a reminder on how to extract all the information I mentioned before from a boxplot, please have a look at this article.

violinplot

Here is a basic violin plot.

ax = sns.violinplot(x=tips["total_bill"])
Image for post

The Violin plot shows the distribution of the data. You may think it is like a histogram then. Yes, but it can be more advanced. Like the plot below shows the distribution of the total bill for each day by the smoker and non-smoker.

plt.figure(figsize=(10, 7))
sns.violinplot(x='day', y='total_bill', hue="smoker",
              data=tips, palette="muted")
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.title("Total Bill per Day of the Week")
plt.show()
Image for post

Instead of separating by two violins smoker and non-smoker portion can be shown in one violin on different sides. Look at this plot.

plt.figure(figsize=(10, 7))
sns.violinplot(x='day', y='total_bill', hue="smoker",
              data=tips, palette="muted", split=True)
plt.xlabel("Day")
plt.ylabel("Total Bill")
plt.title("Total Bill per Day of the Week")
plt.show()
Image for post

Here blue color is showing the distribution of the total bill for smokers and the yellow side is for nonsmokers.

Violin plots can be combined with other types of plots. Here is an example where swarm plots are shown in the violin plots. It looks nice plus gives an idea of how much data are associated with these distributions.

plt.figure(figsize=(10, 6))sns.violinplot(x='day', y='total_bill', inner=None,
              data=tips, palette="muted")sns.swarmplot(x='day', y='total_bill',
              data=tips, color="k", alpha=0.9)
plt.ylabel("Total Bill")
plt.xlabel("Day")
plt.title("Total Bill per Day")
plt.show()
Image for post

Heatmaps are used to show the correlation between variables. Heatmap is very useful in many areas of data science. In data storytelling projects, this is a popular element, in machine learning it helps with choosing features.

This is a basic heatmap that shows the correlation between the total bill and tip amount.

sns.heatmap(tips[["total_bill", "tip"]].corr(), annot=True, 
            linewidths=0.9, linecolor="gray")
plt.show()
Image for post

Let’s go back to the iris dataset. It will be interesting to see the correlations between the sepal length and width, petal length, and width.

plt.figure(figsize=(8, 6))
sns.heatmap(iris.corr(), annot=True, linewidths=0.5, cmap='crest')
plt.show()
Image for post

See the colormap. Darker the colors, the stronger the correlations.

We worked on facetgrid style plots before. But did not directly use the function facet grid. Here is an example of a facet grid function:

g = sns.FacetGrid(tips, col="time")
g.map(sns.scatterplot, "total_bill", "tip")
g.fig.set_size_inches(12, 8)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

It can be further segregated by gender as well.

g = sns.FacetGrid(tips, col="time", row="sex")
g.map(sns.scatterplot, "total_bill", "tip")
g.fig.set_size_inches(12, 12)
g.set_axis_labels("Total Bill", "Tip")
plt.show()
Image for post

Another very useful plot. Just see an example for yourself, we will explain after that.

df = sns.load_dataset('iris')
sns.set_style('ticks')
sns.pairplot(df, hue="species", diag_kind='kde', kind='scatter', palette='husl')
plt.show()
Image for post

This plot shows the relationships between each pair of variables in the same plot. At the same time, gives you the distribution of each continuous variable. We set ‘hue=species’ here to show the different colors for different species. The information of so many singular plots packed in this plot.

This tutorial is not complete without showing even one time series heatmap.

I will use this dataset for the next plots

Let’s import the dataset:

df = pd.read_csv("stock_data.csv", parse_dates=True, index_col = "Date")
df.head()
Image for post

Our goal is to plot a heatmap of “Open” data by months and years. For that, we need to retrieve the months and years from the “Date” and make separate columns of ‘month’ and ‘year’.

df['month'] = df.index.month
df['year'] = df.index.year

If you recheck the dataset, you will find a ‘month’ and ‘year’ column in that. Using pandas pivot table function, make the dataset of months and years where months will be the index and years will be the columns and the ‘Open’ data are the values.

import calendar
all_month_year_df = pd.pivot_table(df, values="Open",
                                   index=["month"],
                                   columns=["year"],
                                   fill_value=0,
                                   margins=True)
named_index = [[calendar.month_abbr[i] if isinstance(i, int) else i for i in list(all_month_year_df.index)]] # name months
all_month_year_df = all_month_year_df.set_index(named_index)
all_month_year_df
Image for post

The dataset is ready to make our heatmap!

plt.figure(figsize=(10,10))
ax = sns.heatmap(all_month_year_df, cmap='GnBu', robust=True, fmt='.2f', 
                 annot=True, linewidths=.5, annot_kws={'size':11}, 
                 cbar_kws={'shrink':.8, 'label':'Open'})
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=10) ax.set_xticklabels(ax.get_xticklabels(), rotation=0, fontsize=10) plt.title(‘Average Opening’, fontdict={‘fontsize’:18}, pad=14)
Image for post

Clustermap is also like heatmaps. It does not show the numbers. Only hierarchical clusters by colors only. Let’s see an example:

sns.clustermap(all_month_year_df, linewidths=.5, cmap = "coolwarm")
Image for post

Look at the x-tick and y-tick labels in this plot. They are following the hierarchy of the clusters. 2017 is the highest and 2016 being the lowest. If you want to have the years in order, set ‘col_cluster’ equal to False.

sns.clustermap(all_month_year_df, linewidths=.5, cmap = "coolwarm", col_cluster=False)
Image for post

Now, years are in order. You can get the months in order by setting ‘row_cluster’ equal to False as well. Please try that for yourself.

Please feel free to check this article to find a wide selection of time series data visualization options.

Congrats! If you really worked on all those plots today, you came a long way! Seaborn is a huge library. Of course, these are not all. But this article covers a lot! There is a lot more to learn about this library. I am hoping to make some more tutorials on some more plots in the future sometimes. But before that please feel free to have a look at the advanced visualization in python articles in the ‘more reading’ section below. They have some more collections of advanced plots in Matplotlib and Seaborn.

Feel free to follow me on Twitter and like my Facebook page.

#python #DataScience #DataScientists #DataVisualization #Seaborn #DataAnalytics

Leave a Reply

Close Menu