Matplotlib is the most widely used visualization tools in python. It is well supported in a wide range of environments such as web application servers, graphical user interface toolkits, Jupiter notebook and iPython notebook, iPython shell.
Matplolib Architecture
Matplotlib has three main layers: the backend layer, the artist layer, and the scripting layer. The backend layer has three interface classes: figure canvas that defines the area of the plot, renderer that knows how to draw on figure canvas, and event that handles the user inputs such as clicks. The Artist layer knows how to use the Renderer and draw on the canvas. Everything on a Matplotlib plot is an instance of an artist layer. The ticks, title, labels the plot itself everything is an individual artist. The scripting layer is a lighter interface and very useful for everyday purposes. In this article, I will demonstrate all the examples using the scripting layer and I used a Jupyter Notebook environment.
I suggest that you run every piece of code yourself if you are reading this article to learn.
Data Preparation
Data preparation is a common task before any data visualization or data analysis project. Because data never comes in the way you want. I am using a dataset that contains Canadian Immigration information. Import the necessary packages and the dataset first.
import numpy as np
import pandas as pd
df = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
df.head()
I am skipping the first 20 rows and the last 2 rows because they are just text not tabulated data. The dataset is too big. So I cannot show a screenshot of the data. But to get the idea about the dataset, see the column names:
df.columns#Output: Index([ 'Type', 'Coverage', 'OdName', 'AREA', 'AreaName', 'REG', 'RegName', 'DEV', 'DevName', 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype='object')
We are not going to use all the columns for this article. So, let’s get rid of the columns that we are not using to make the dataset smaller and more manageable.
df.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df.head()

Look at the columns. The column ‘OdName’ is actually country name, ‘AreaName’ is continent and ‘RegName’ is the region of the continent. Change the column names to something more understandable.
df.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True) df.columns#Output: Index([ 'Country', 'Continent', 'Region', 'DevName', 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013], dtype='object')
Now, the dataset has become more understandable. We have Country, Continent, Region, DevName that says if the country is developing, or developed. All the year columns contain the number of immigrants in that particular year. Now, add a ‘total’ column which will show the total immigrants that came into Canada from 1980 to 2013 from each country.
df['Total'] = df.sum(axis=1)

Look, a new column ‘Total’ is added at the end.
Check if there are any null values in any of the columns
df.isnull().sum()
It sows zero null values in all the columns. I always like to set a meaningful column as an index instead of just some numbers.
Set the ‘Country’ column as the index
df = df.set_index('Country')

This dataset was nice and clean to start with. So this was enough cleaning for now. If we need something else we will do that as we go.
Plotting Exercises
We will practice several different types of plot in this article such as line plot, area plot, pie plot, scatter plot, histogram, bar graph.
First, import necessary packages
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
Chose a style so you do not have to work too hard to style the plot. Here are the types of styles available:
plt.style.available#Output: ['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']
I am taking a ‘ggplot’ style. Feel free to try any other style for yourself.
mpl.style.use(['ggplot'])
Line plot
It will be useful to see a country’s immigration tend to Canada by year. Make a list of the years 1980 to 2013.
years = list(map(int, range(1980, 2014)))
I picked Switzerland for this demonstration. Prepare the immigration data of Switzerland and the years.
df.loc['Switzerland', years]

Here is part of the data of Switzerland. It’s time to plot. It is very simple. Just call the plot function on the data we prepared. Then add title and the labels for the x-axis and y-axis.
df.loc['Switzerland', years].plot() plt.title('Immigration from Switzerland') plt.ylabel('Number of immigrants') plt.xlabel('Years')plt.show()

What if we want to observe the immigration trend over the years for several countries to compare those countries’ immigration trends to Canada? That’s almost the same as the previous example. Plot the number of immigrants of three south Asian countries India, Pakistan, and Bangladesh vs the years.
ind_pak_ban = df.loc[['India', 'Pakistan', 'Bangladesh'], years]
ind_pak_ban.head()

Look at the format of the data. It is different than the data for Switzerland above. If we call the plot function on this DataFrame(ind_pak_ban), it will plot the number of immigrants for each country in the x-axis and the years in the y-axis. We need to change the format of the dataset:
ind_pak_ban.T

This is not the whole dataset. Just a part of it. See, how the format of the dataset changed. Now it will plot the years in the x-axis and the number of immigrants for each country on the y-axis.
ind_pak_ban.T.plot()

We did not have to mention the kind of the plot here because by default it plots the line plot.
Pie Plot
To demonstrate the pie plot we will plot the total number of immigrants for each continent. We have the data for each country. So, group the number of immigrants, to sum up, the total number of immigrants for each continent.
cont = df.groupby('Continent', axis=0).sum()
Now, we have data that shows the number of immigrants for each continent. Please feel free to print this DataFrame to see the result. I am not showing it because it is horizontally too big to present here. Let’s plot it.
cont['Total'].plot(kind='pie', figsize=(7,7),
autopct='%1.1f%%',
shadow=True)
#plt.title('Immigration By Continenets')
plt.axis('equal')
plt.show()

Notice, I have to use the ‘kind’ parameter. Other than the line plot, all other plots need to be mentioned explicitly in the plot function. I am introducing a new parameter ‘figsize’ that will determine the size of the plot.
This pie chart is understandable. But we can improve it with a little effort. This time I want to choose my own colors and a start angle.
colors = ['lightgreen', 'lightblue', 'pink', 'purple', 'grey', 'gold'] explode=[0.1, 0, 0, 0, 0.1, 0.1] cont['Total'].plot(kind='pie', figsize=(17, 10), autopct = '%1.1f%%', startangle=90, shadow=True, labels=None, pctdistance=1.12, colors=colors, explode = explode) plt.axis('equal')plt.legend(labels=cont.index, loc='upper right', fontsize=14) plt.show()

Is this pie chart better? I liked it.
Box plot
We will make a box plot of the immigrant’s number of China first.
china = df.loc[['China'], years].T
Here is our data. This is the box plot.
china.plot(kind='box', figsize=(8, 6))
plt.title('Box plot of Chinese Immigratns')
plt.ylabel('Number of Immigrnts')
plt.show()

If you need a refresher on boxplots, please check this article:
Understanding the Data Using Histogram and Boxplot With Example
Learn how to extract the maximum information from a Histogram and Boxplot.
towardsdatascience.com
We can plot several boxplots in the same plot. Use the DataFrame ‘ind_pak_ban’ and make box plots of the number of immigrants of India, Pakistan, and Bangladesh.
ind_pak_ban.T.plot(kind='box', figsize=(8, 7))
plt.title('Box plots of Inian, Pakistan and Bangladesh Immigrants')
plt.ylabel('Number of Immigrants')

Scatter Plot
A Scatter plot is the best to understand the relationship between variables. Make a scatter plot to see the trend of the number of immigrants to Canada over the years.
For this exercise, I will make a new DataFrame that will contain the years as an index and the total number of immigrants each year.
totalPerYear = pd.DataFrame(df[years].sum(axis=0))
totalPerYear.head()

We need to convert the years to integers. I want to polish the DataFrame a bit just to make it presentable.
totalPerYear.index = map(int, totalPerYear.index)
totalPerYear.reset_index(inplace=True)
totalPerYear.head()

For the scatter plot, we need to specify the x-axis and y-axis for the scatter plot.
totalPerYear.plot(kind='scatter', x = 'year', y='total', figsize=(10, 6), color='darkred') plt.title('Total Immigration from 1980 - 2013') plt.xlabel('Year') plt.ylabel('Number of Immigrants')plt.show()

Looks like there is a linear relationship between the years and the number of immigrants. Over the years the number of immigrants shows an increasing trend.
Area Plot
The area plot shows the area covered under a line plot. For this plot, I want to make DataFrame including the information of India, China, Pakistan, and France.
top = df.loc[['India', 'China', 'Pakistan', 'France'], years]
top = top.T

The dataset is ready. Now plot.
colors = ['black', 'green', 'blue', 'red'] top.plot(kind='area', stacked=False, figsize=(20, 10), colors=colors)plt.title('Immigration trend from Europe') plt.ylabel('Number of Immigrants') plt.xlabel('Years') plt.show()

Remember to use this ‘stacked’ parameter above, if you want to see the individual countries area plot. If you do not set the stacked parameter to be False, the plot will look like this:

When it is unstacked, it does not show the individual variable’s area. It stacks on to the previous one.
Histogram
The histogram shows the distribution of a variable. Here is an example:
df[2005].plot(kind='hist', figsize=(8,5)) plt.title('Histogram of Immigration from 195 Countries in 2010') # add a title to the histogram plt.ylabel('Number of Countries') # add y-label plt.xlabel('Number of Immigrants') # add x-labelplt.show()

We made a histogram to show the distribution of 2005 data. The plot shows, Canada had about 0 to 5000 immigrants from most countries. Only a few countries contributed 20000 and a few more countries sent 40000 immigrants.
Let’s use the ‘top’ DataFrame from the scatter plot example and plot each country’s distribution of the number of immigrants in the same plot.
top.plot.hist() plt.title('Histogram of Immigration from Some Populous Countries') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants')plt.show()

In the previous histogram, we saw that Canada had 20000 and 40000 immigrants from a few countries. Looks like China and India are amongst those few countries. In this plot, we do not see the bin edges clearly. Let’s improve this plot.
Specify the number of bins and find out the bin edges
I will use 15 bins. I am introducing a new parameter here called ‘alpha’. The alpha value determines the transparency of the colors. For these types of overlapping plots, transparency is important to see the shape of each distribution.
count, bin_edges = np.histogram(top, 15)
top.plot(kind = 'hist', figsize=(14, 6), bins=15, alpha=0.6,
xticks=bin_edges, color=colors)

I did not specify the colors. So, the colors came out differently this time. But see the transparency. Now, you can see the shape of each distribution.
Like the area plot, you can make a stacked plot of the histogram as well.
top.plot(kind='hist', figsize=(12, 6), bins=15, xticks=bin_edges, color=colors, stacked=True, ) plt.title('Histogram of Immigration from Some Populous Countries') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants')plt.show()

Bar Plot
For the bar plot, I will use the number of immigrants from France per year.
france = df.loc['France', years] france.plot(kind='bar', figsize = (10, 6)) plt.xlabel('Year') plt.ylabel('Number of immigrants') plt.title('Immigrants From France')plt.show()

You can add extra information to the bar plot. This plot shows an increasing trend since 1997 for over a decade. It could be worth mentioning. It can be done using an annotate function.
france.plot(kind='bar', figsize = (10, 6)) plt.xlabel('Year') plt.ylabel('Number of immigrants') plt.title('Immigrants From France')plt.annotate('Increasing Trend', xy = (19, 4500), rotation= 23, va = 'bottom', ha = 'left')plt.annotate('', xy=(29, 5500), xytext=(17, 3800), xycoords='data', arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='black', lw=1.5))plt.show()

Sometimes, showing the bars horizontally makes it more understandable. Showing a label on the bars can be even better. Let’s do it.
france.plot(kind='barh', figsize=(12, 16), color='steelblue') plt.xlabel('Year') # add to x-label to the plot plt.ylabel('Number of immigrants') # add y-label to the plot plt.title('Immigrants From France') # add title to the plotfor index, value in enumerate(france): label = format(int(value), ',') plt.annotate(label, xy=(value-300, index-0.1), color='white')plt.show()

Isn’t it better than the previous one?
In this article, we learned the basics of Matplotlib. This should give you enough knowledge to start using the Matplotlib library today.