How to Create and Use Multi-Index DataFrame to Scale Up Your Data Analysis

How to Create and Use Multi-Index DataFrame to Scale Up Your Data Analysis

In most of the DataFrames, we see one index that works as a row identifier. But what if there is more than one index in a DataFrame. Is multi-index DataFrames a good thing for data analysis or confusing? Actually, it can be a very good data analysis, if you know how to use them. It gives you easy access to analyze the data even more efficiently.

To really improve and upscale your data analysis skill, it is important to learn about multi-index DataFrame well. In this article we will see:

How to create a multi-index DataFrame

How to use a multi-index DataFrame for efficient data analysis.

Use of the ‘groupby’ and aggregate functions on a multi-index DataFrame.

I also included the video version of all three topic mentioned above. If you like the video tutorials more please feel free to watch the video version.

Let’s start with the creation of multi-index DataFrames. As Data Analysts or Data scientists, we not only need to know the analysis but also need to learn how to store data in an organized way too.

How to create a multi-index DataFrame?

To create a multi-index DataFrame, I will first create the indices. I will generate a two-layered index. So, the index will look like this:

import pandas as pd
import numpy as npindex = [('Bob', 2019), ('Bob', 2020),
        ('Lilly', 2019), ('Lilly', 2010),
        ('Shelly', '2019'), ('Shelly', 2020)]

This index is showing the names in the first layer and years in the second layer.

In the index above, we mentioned six people. A list with the math scores of these six people is here:

Math = [98, 93, 78, 88, 92, 81]

We have the index and data both ready. We will make a Series with this Data first:

pop = pd.Series(Math, index=index)


(Bob, 2019)       98
(Bob, 2020)       93
(Lilly, 2019)     78
(Lilly, 2010)     88
(Shelly, 2019)    92
(Shelly, 2020)    81
dtype: int64

Notice we have each person’s score twice. The first one is for 2019 and the second one is for 2020.

Here I will add one more score information and make a DataFrame that will have math score and reading score:

pop_df = pd.DataFrame({'math': pop,
                      'reading': [88, 81, 73, 81, 91, 94]})

We did it using a long route. Here is the straightforward way:

df = pd.DataFrame(data,
                 index=[['Bob', 'Bob', 'Lilly', 'Lilly',
                        'Shelly', 'Shelly'], 
                        [2019, 2020, 2019, 2020, 2019, 2020]],
                  columns = ["Math", "Reading"]

It will produce exactly the same result as before.

The next example will show how to have two levels of the index and also two levels of columns. First I have the scores list here. Notice that the list is two-dimensional.

scores = [[90, 67, 89, 76, 93, 90], 
          [90, 67, 89, 76, 93, 90], 
          [90, 67, 89, 76, 93, 90], 
          [90, 67, 89, 76, 93, 90], 
          [90, 67, 89, 76, 93, 90],
          [90, 67, 89, 76, 93, 90]]

Here I am making the DataFrame using the scores list above. We will set two levels of indices and two levels of columns as well:

             index = [['school1', 'school1', 'school2', 'school2', 'school3', 'school3'], [2019, 2020, 2019, 2020, 2019, 2020]],
             columns = [['7th grade', '7th grade', '7th grade', '8th grade', '8th grade', '8th grade'], ['math', 'reading', 'writing','math', 'reading', 'writing']]

Here we stored 7th and 8th-grade math, reading, and writing scores from three schools and two years in a very organized way.

Here is the video version of the creation of the multi-index DataFrame:

Slicing of Multi-Index DataFrame

For this demonstration, we will use a fictional Dataset that can only be used for the DataScience Training purpose. Please feel free to download the dataset to follow along from this link:

Exam Scores

This is a fictional dataset and should only be used for data science training purposes. This data set includes scores…

Importing the dataset first:

df = pd.read_csv('StudentsPerformance.csv')

This dataset only has a numeric serial number as an index now. Let’s use some meaningful columns as indices. As we will work on multi-index DataFrame, I will use five layers of the index here:

mdf = df.set_index(['race/ethnicity', 'parental level of education', 'lunch','test preparation course','gender'])

We have race/ethnicity, parental level of education, lunch, test preparation course, and gender as indices.

These layers can be used to slice the dataset into granular portions under different conditions. Some examples are coming next.

First, we want to see the data where race/ethnicity is ‘group B’ only.

mdf.xs('group B')

Here is the part of the output:

This is the portion of the dataset that shows the group B data only. As we sliced the data applying the condition on race/ethnicity, race/ethnicity is not showing as an index anymore.

Further slicing the data using more conditions here. This time we will see the score of the students where race/ethnicity is group B, parental level of education is ‘some college’, and lunch is ‘standard’:

mdf.xs(('group B', 'some college', 'standard'))

This time we can share the full output.

Finally, let’s put even finer condition. Here we will see the scores of the students where rate/ethnicity is ‘group B’, parental level of education is ‘some college’, ‘lunch is ‘standard’, and test preparation course is complete and gender is male:

mdf.xs(('group B', 'some college', 'standard', 'completed', 'male'))

When the index value is used exactly in the same order as the dataset, you do not need to mention which level of the index value is which one. You can simply use the index value you are looking for.

But suppose you want to find the data where the test preparation course is complete only. You do not care about race/ethnicity or parental level of education or lunch. In that case, it is necessary to mention the level number or the level directly.

Here we want to see the data where the test preparation course is completed. The test preparation course is the fourth level of the index. So, we have to use level =3. Consider race/ethnicity as the index level 0.

mdf.xs('completed', level = 3)

Part of the output:

Or you can also use the name of the level explicitly as follows:

mdf.xs('completed', level = 'test preparation course')

Here I am using two levels of indices:

mdf.xs(('group A', 'completed'), level = ['race/ethnicity', 'test preparation course'])

Part of the output:

Here is the video version of slicing the multi-index DataFrame:

Here you will find another video on slicing and indexing using .loc(), iloc(), and IndexSlice that I did not show in this article. Please free to check it out:

Using the ‘groupby’ function on multi-index DataFrame

I assume you remember how useful groupby function is for Data Analysis. On multi-index DataFrame, it is even more efficient.

Here I am getting the mean math, reading, and writing scores for each race/ethnicity:

mdf.groupby(level = 0).mean()

Here we will advance a little bit and find the mean scores for individual gender of each race/ethnicity. Remember, race/ethnicity is the index level 0 and gender is the index level 4.

mdf.groupby(level = [0, 4]).mean()

You can also use the name of the index level like this to get the same result:

mdf.groupby(level = ['race/ethnicity', 'gender']).mean()

Before we got all three scores. We can also mention the specific score we want to see in the resulting data. I want to get the mean math and writing scores only for each gender of each race/ethnicity:

mdf.groupby(level = ['race/ethnicity', 'gender'])['math score', 'writing score'].mean()

In the above output dataset, we see only the mean reading and writing scores. Next, I also will add the median to the output dataset:

mdf.groupby(level = ['race/ethnicity', 'gender'])['math score', 'writing score'].agg(['mean', 'median'])

What if you need different types of aggregation of different scores. For example, here I am using mean and median on math score and min, max, and std on writing score.

We are also using three levels of indices here:

mdf.groupby(level = ['race/ethnicity', 'parental level of education', 'gender'])['math score', 'writing score'].agg(
        'math score': ['mean', 'median'],
        'writing score': ['min', 'max', 'std']

Part of the output:

Here is the video version of using the groupby function on multi-index DataFrame:


The examples and videos on this page should totally prepare you to use multi-index DataFrame with efficiency and level up your data analysis. It is a lot simpler and easier to get granular level access to the data with much simple and less code.

I hope it was helpful.

Please feel free to follow me on Twitter, and the Facebook page

#DataScience #DataAnalysis #Pandas #Python

Leave a Reply

Close Menu