Start Using Matplotlib Today With This Basic Visualization

Matplotlib is the most widely used visualization tools in python. It is well supported in a wide range of environments such as web application servers, graphical user interface toolkits, Jupiter notebook and iPython notebook, iPython shell.

Matplolib Architecture

Matplotlib has three main layers: the backend layer, the artist layer, and the scripting layer. The backend layer has three interface classes: figure canvas that defines the area of the plot, renderer that knows how to draw on figure canvas, and event that handles the user inputs such as clicks. The Artist layer knows how to use the Renderer and draw on the canvas. Everything on a Matplotlib plot is an instance of an artist layer. The ticks, title, labels the plot itself everything is an individual artist. The scripting layer is a lighter interface and very useful for everyday purposes. In this article, I will demonstrate all the examples using the scripting layer and I used a Jupyter Notebook environment.

I suggest that you run every piece of code yourself if you are reading this article to learn.

Data Preparation

Data preparation is a common task before any data visualization or data analysis project. Because data never comes in the way you want. I am using a dataset that contains Canadian Immigration information. Import the necessary packages and the dataset first.

I am skipping the first 20 rows and the last 2 rows because they are just text not tabulated data. The dataset is too big. So I cannot show a screenshot of the data. But to get the idea about the dataset, see the column names:

#Output:
Index([    'Type', 'Coverage',   'OdName',     'AREA', 'AreaName',      'REG',
        'RegName',      'DEV',  'DevName',       1980,       1981,       1982,
             1983,       1984,       1985,       1986,       1987,       1988,
             1989,       1990,       1991,       1992,       1993,       1994,
             1995,       1996,       1997,       1998,       1999,       2000,
             2001,       2002,       2003,       2004,       2005,       2006,
             2007,       2008,       2009,       2010,       2011,       2012,
             2013],
      dtype='object')

We are not going to use all the columns for this article. So, let’s get rid of the columns that we are not using to make the dataset smaller and more manageable.

Image for post

Look at the columns. The column ‘OdName’ is actually country name, ‘AreaName’ is continent and ‘RegName’ is the region of the continent. Change the column names to something more understandable.

#Output:
Index([  'Country', 'Continent',    'Region',   'DevName',        1980,
              1981,        1982,        1983,        1984,        1985,
              1986,        1987,        1988,        1989,        1990,
              1991,        1992,        1993,        1994,        1995,
              1996,        1997,        1998,        1999,        2000,
              2001,        2002,        2003,        2004,        2005,
              2006,        2007,        2008,        2009,        2010,
              2011,        2012,        2013],
      dtype='object')

Now, the dataset has become more understandable. We have Country, Continent, Region, DevName that says if the country is developing, or developed. All the year columns contain the number of immigrants in that particular year. Now, add a ‘total’ column which will show the total immigrants that came into Canada from 1980 to 2013 from each country.

Image for post

Look, a new column ‘Total’ is added at the end.

Check if there are any null values in any of the columns

It sows zero null values in all the columns. I always like to set a meaningful column as an index instead of just some numbers.

Set the ‘Country’ column as the index

Image for post

This dataset was nice and clean to start with. So this was enough cleaning for now. If we need something else we will do that as we go.

Plotting Exercises

We will practice several different types of plot in this article such as line plot, area plot, pie plot, scatter plot, histogram, bar graph.

First, import necessary packages

Chose a style so you do not have to work too hard to style the plot. Here are the types of styles available:

#Output:
['bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark-palette',
 'seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'seaborn',
 'Solarize_Light2',
 'tableau-colorblind10',
 '_classic_test']

I am taking a ‘ggplot’ style. Feel free to try any other style for yourself.

Line plot

It will be useful to see a country’s immigration tend to Canada by year. Make a list of the years 1980 to 2013.

I picked Switzerland for this demonstration. Prepare the immigration data of Switzerland and the years.

Image for post

Here is part of the data of Switzerland. It’s time to plot. It is very simple. Just call the plot function on the data we prepared. Then add title and the labels for the x-axis and y-axis.

plt.show()
Image for post

What if we want to observe the immigration trend over the years for several countries to compare those countries’ immigration trends to Canada? That’s almost the same as the previous example. Plot the number of immigrants of three south Asian countries India, Pakistan, and Bangladesh vs the years.

Image for post

Look at the format of the data. It is different than the data for Switzerland above. If we call the plot function on this DataFrame(ind_pak_ban), it will plot the number of immigrants for each country in the x-axis and the years in the y-axis. We need to change the format of the dataset:

Image for post

This is not the whole dataset. Just a part of it. See, how the format of the dataset changed. Now it will plot the years in the x-axis and the number of immigrants for each country on the y-axis.

Image for post

We did not have to mention the kind of the plot here because by default it plots the line plot.

Pie Plot

To demonstrate the pie plot we will plot the total number of immigrants for each continent. We have the data for each country. So, group the number of immigrants, to sum up, the total number of immigrants for each continent.

Now, we have data that shows the number of immigrants for each continent. Please feel free to print this DataFrame to see the result. I am not showing it because it is horizontally too big to present here. Let’s plot it.

Image for post

Notice, I have to use the ‘kind’ parameter. Other than the line plot, all other plots need to be mentioned explicitly in the plot function. I am introducing a new parameter ‘figsize’ that will determine the size of the plot.

This pie chart is understandable. But we can improve it with a little effort. This time I want to choose my own colors and a start angle.

plt.legend(labels=cont.index, loc='upper right', fontsize=14)
plt.show()
Image for post

Is this pie chart better? I liked it.

Box plot

We will make a box plot of the immigrant’s number of China first.

Here is our data. This is the box plot.

Image for post

If you need a refresher on boxplots, please check this article:

Understanding the Data Using Histogram and Boxplot With Example

Learn how to extract the maximum information from a Histogram and Boxplot.

towardsdatascience.com

We can plot several boxplots in the same plot. Use the DataFrame ‘ind_pak_ban’ and make box plots of the number of immigrants of India, Pakistan, and Bangladesh.

Image for post

Scatter Plot

A Scatter plot is the best to understand the relationship between variables. Make a scatter plot to see the trend of the number of immigrants to Canada over the years.

For this exercise, I will make a new DataFrame that will contain the years as an index and the total number of immigrants each year.

Image for post

We need to convert the years to integers. I want to polish the DataFrame a bit just to make it presentable.

Image for post

For the scatter plot, we need to specify the x-axis and y-axis for the scatter plot.

plt.show()
Image for post

Looks like there is a linear relationship between the years and the number of immigrants. Over the years the number of immigrants shows an increasing trend.

Area Plot

The area plot shows the area covered under a line plot. For this plot, I want to make DataFrame including the information of India, China, Pakistan, and France.

Image for post

The dataset is ready. Now plot.

plt.title('Immigration trend from Europe')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()
Image for post

Remember to use this ‘stacked’ parameter above, if you want to see the individual countries area plot. If you do not set the stacked parameter to be False, the plot will look like this:

Image for post

When it is unstacked, it does not show the individual variable’s area. It stacks on to the previous one.

Histogram

The histogram shows the distribution of a variable. Here is an example:

plt.show()
Image for post

We made a histogram to show the distribution of 2005 data. The plot shows, Canada had about 0 to 5000 immigrants from most countries. Only a few countries contributed 20000 and a few more countries sent 40000 immigrants.

Let’s use the ‘top’ DataFrame from the scatter plot example and plot each country’s distribution of the number of immigrants in the same plot.

plt.show()
Image for post

In the previous histogram, we saw that Canada had 20000 and 40000 immigrants from a few countries. Looks like China and India are amongst those few countries. In this plot, we do not see the bin edges clearly. Let’s improve this plot.

Specify the number of bins and find out the bin edges

I will use 15 bins. I am introducing a new parameter here called ‘alpha’. The alpha value determines the transparency of the colors. For these types of overlapping plots, transparency is important to see the shape of each distribution.

Image for post

I did not specify the colors. So, the colors came out differently this time. But see the transparency. Now, you can see the shape of each distribution.

Like the area plot, you can make a stacked plot of the histogram as well.

plt.show()
Image for post

Bar Plot

For the bar plot, I will use the number of immigrants from France per year.

plt.show()
Image for post

You can add extra information to the bar plot. This plot shows an increasing trend since 1997 for over a decade. It could be worth mentioning. It can be done using an annotate function.

plt.annotate('Increasing Trend',
            xy = (19, 4500),
            rotation= 23,
            va = 'bottom',
            ha = 'left')plt.annotate('',
            xy=(29, 5500),
            xytext=(17, 3800),
            xycoords='data',
            arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='black', lw=1.5))plt.show()
Image for post

Sometimes, showing the bars horizontally makes it more understandable. Showing a label on the bars can be even better. Let’s do it.

for index, value in enumerate(france):
    label = format(int(value), ',')
    plt.annotate(label, xy=(value-300, index-0.1), color='white')
plt.show()
Image for post

Isn’t it better than the previous one?

In this article, we learned the basics of Matplotlib. This should give you enough knowledge to start using the Matplotlib library today.

Leave a Reply

Close Menu