Binning the data can be a very useful strategy while dealing with numeric data to understand certain trends. Sometimes, we may need an age range, not the exact age, a profit margin not profit, a grade not a score. The Binning of data is very helpful to address those. Pandas library has two useful functions cut and qcut for data binding. But sometimes they can be confusing. In this article, I will try to explain the use of both in detail.
Binning
To understand the concept of binning, we may refer to a histogram. I am going to use a student performance dataset for this tutorial. Please feel free to download the dataset from this link.
Import the necessary packages and the dataset now.
import pandas as pd import numpy as np import seaborn as sns
df = pd.read_csv('StudentsPerformance.csv')
Using the dataset above, make a histogram of the math score data:
df['math score'].plot(kind='hist')
We did not mention any number of bins here but behind the scene, there was a binning operation. Math scores have been divided into 10 bins like 20–30, 30–40. There are many scenarios where we need to define the bins discretely and use them in the data analysis.
qcut
This function tries to divide the data into equal-sized bins. The bins are defined using percentiles, based on the distribution and not on the actual numeric edges of the bins. So, you may expect the exact equal-sized bins in simple data like this one
pd.Series(pd.qcut(range(100), 4)).value_counts()
In this example, we just gave a range from 0 to 99 and asked the qcut function to divide it into 4 equal bins. It made 4 equal bins of 25 elements each. But when the data is bigger and the distribution is a bit complex, the value_counts in each bin may not be equal as the bins are defined using the percentiles.
Here are some example use cases of qcut:
Exercise 1: Generate 4 bins of equal distribution
The most simple use of qcut is, specifying the bins and let the function itself divide the data. Divide the math scores in 4 equal percentile.
pd.qcut(df['math score'], q=4)
The result is much bigger. I am showing just a part of it. Though the math score also ranges from 0 to 100, the bin edges are not as 25, 50, 75, and 100 as the previous demonstration. Because the population distribution is not so uniform. As the histogram above shows, the maximum students scored from 60 to 80, and the distribution is left-skewed.
Exercise 2. Check the value counts in each bin to examine the bin edges
pd.qcut(df['math score'], q=4).value_counts()
As per this table, 266 students fell in the 25 percentile range with a score of 0 to 57. 241 students are in 50 percentile with a score between 57 and 66. So many students in a small range of scores. The range is large at the end. The last quarter has a big range, from 77 to 100.
Exercise 3. Work on another simple qcut with a different number of bins
This time I want to add this as a column in the DataFrame and include one parameter called precision. Precision defines how many decimal points to use while calculating bin precision.
df['math_score_7'] = pd.qcut(df['math score'], q=7, precision = 0)
Look at the right of the dataset. A new column showing the bins is added. Let’s check the value_counts of the bins
These bin labels are automatically calculated by pandas behind the scene. So, they can be harder to explain to a client. This is also harder to use when we need a specific division. But there is a solution to that. qcut allows explicit labels for the bins.
As the dataset has too many columns that we do not need for this tutorial, I am deleting some columns. Because we will add some columns while working on the exercises.
df = df.drop(columns=['test preparation course', 'lunch', 'writing score', 'parental level of education'])
Exercise 4. Grade the students based on the distribution.
Make a list of grades that we may want to use. In the previous examples, we put the number of bins as the value of q. This time, we have to explicitly add the range for the grades. Look at the code below. We have five different grades.
labels_5 = ['C', "B-", 'B', 'A-', 'A']
df['math score_labels'] = pd.qcut(df['math score'], q=[0, 0.4, 0.5, 0.65, 0.80, 1], labels=labels_5)
One thing to remember here is that the proportions or percentages I added in the q list are not the math score. They are based on population distribution as I mentioned earlier. The grades can be assigned based on the scores easily using the cut method. I will show that in later examples.
5. Repeat the same exercise as the previous one without labels.
df['math score_no_labels'] = pd.qcut(df['math score'], q=[0, 0.4, 0.5, 0.65, 0.80, 1], labels=False, precision=1)
At the right, we have the new column ‘math score_no_labels’ that we just created. As I specified the precision as 1, it took one decimal point. 0.0, 1.0, 2.0,and 4.0 are automatically generated serials. It will look even clearer if we do the value_counts for the bins.
df['math score_no_labels'].value_counts()
When we put no_labels, we kind of get grades according to the distribution of the population.
6. Return the labels for each of the grades.
For this example, use the parameter retbins to return the threshold or the upper limit for each bin.
results, bin_edges = pd.qcut(df['math score'], q=[0, 0.4, 0.5, 0.65, 0.80, 1], labels=labels_5, retbins=True)
results_table = pd.DataFrame(zip(bin_edges, labels_5), columns=['Limits', 'Tier'])
cut
Another method for binning. But the concept is different. In qcut, when you pass q=4, it will try to divide the population equally and calculate the bin edges accordingly. But in the cut method, it divides the range of the data in equal 4 and the population will follow accordingly.
In exercise two above, when we passed q=4, the first bin was, (-.001, 57.0]. But if we use the cut method and pass bins=4, the bins thresholds will be 25, 50, 75, 100. Because the total score was 100. After that, it will automatically calculate the population that falls in those bins.
pd.cut(df['math score'], bins=4).value_counts()
It may be important to know the bin edges more precisely. If you look at the table above, one bin is (50, 75] and the next one is (75, 100]. A student needs to know if his or her score is 75, which grade he or she gets. It is important to know that ‘(‘ parenthesis means ‘not included’ and ‘]’ means included. So 75 is included in the (50, 75] bin and not included in the (75, 100].
Exercise: Define the grades of the students according to their grades.
Just for the sake of this exercise, assume that there is no failing grade. Define the grades like this: 0–40 is ‘C’, 40 -55 is ‘B-’, 55–65 is ‘B’, 65–75 is ‘A-’, and 75–100 is ‘A’.
cut_grades = ['C', "B-", 'B', 'A-', 'A']
cut_bins =[0, 40, 55, 65, 75, 100]
df['grades'] = pd.cut(df['math score'], bins=cut_bins, labels = cut_grades)
Now, compare this grading with the grading in qcut method. In the qcut method, you do not get to decide the scores to define the grades. You only get to decide the population proportion. The grades are decided based on the percentile. In the cut method, you decide the scores for a certain grade.
In this example, it may look like the cut method more appropriate. Because it makes more sense to grade based on the scores. But there are many other cases in the real world where binning based on the distribution is required. Hope there is no confusion in the concepts of cut and qcut anymore.