Groupby function in Pandas Library For Efficient Data Analysis

Pandas library in Python is an excellent tool for data analysis. It has numerous useful functions. One such useful function is ‘groupby’. It can be used in many different ways. In this article, I am going to show some use of this function with a simple example.

First, import the libraries’ pandas and NumPy. Then import the dataset. This dataset is not a real-world dataset. I made it up. This dataset shows weather data. But these codes and methods will work on any kind of data. 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel("weather_cities.xlsx")
df
Groupby dataset
  1. Group this dataset by cities and return the keys that are the list of the cities.
df.groupby('city').groups.keys()
'''OUtput
dict_keys(['Boston', 'Houston', 'Miami', 'Paris'])

2. Return the data of Miami only.

df.groupby('city').get_group('Miami')

3. Return the maximum temperature of each city.

df.groupby('city')['temperature'].max()
'''
output:
city
Boston 38
Houston 101
Miami 100
Paris 54
Name: temperature, dtype: int64
'''

With this same syntax, 13 other aggregation is possible. Here is the list:

sum()

mean()

size()

count()

std()

var()

sem()

first()

last()

nth()

min()

skew()

median()

4. Calculate the mean temperature for each day in each city.

s = df.groupby(['city', 'day']).temperature.mean()

5. Return the mean temperature for each day in Miami.

s.loc['Miami]
'''Output:
day
2017-01-01 90
2017-01-02 85
2017-01-03 87
2017-01-04 92
Name: temperature, dtype: int64
'''

6. Using the DataFrame s above, get the temperature of Miami in 2017–01–01.

s.loc['Miami', '2017-01-01']
'''Output is 90.'''

7. Make a table of average temperature and a median temperature of each city.

df.groupby('city')['temperature'].agg({'avg_temp': np.average, 'median_temp': np.median}

8. Present the average and median for both temperature and median for each city.

df.groupby('city')['temperature', 'windspeed'].agg({'average': np.average, 'median': np.median})

9. Build a table to present the maximum temperature and the size of the data for each city.

a = df.groupby("city")['temperature'].agg([lambda z: np.mean(z), 'size'])
a.columns = ["Maximum", 'Total']

Miami and Houston were the hottest cities. At the same time, Boston had a freezing temperature.

10. It is important to know that if we have enough data to make a decision. Make a plot to show how much data do we have for each city.

plt.clf()
df.groupby('city').size().plot(kind='bar')

11. Make a bar graph of minimum temperature in each city.

plt.clf()
df.groupby('day')['temperature'].min().plot(kind='bar')

I hope this was helpful. Please feel free to ask any questions. Please click here for the dataset.

Leave a Reply

Close Menu