30 Very Useful Pandas Functions for Everyday Data Analysis Tasks
Pandas functions

30 Very Useful Pandas Functions for Everyday Data Analysis Tasks

Python’s Pandas library is the most widely used library in Python. Because this is the data manipulation library that is necessary for every aspect of data analysis or machine learning. Even if you are working on data visualization or machine learning, some data manipulation will be there anyway. In this article, I will list the Pandas functions that are necessary for everyday use and arguably will be enough to perform the regular data manipulation tasks.

For this article, I will use a public dataset from Kaggle called the FIFA dataset.

The user license is mentioned here.

Please feel free to download the dataset from here.

Here’s the fun begins!

I am importing the necessary packages and the dataset:

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

Let’s start talking about the functions:

1. pd.read_csv, pd.read_excel

The first function to mention is read_csv or read_excel. Till now I used at least one of these functions in every project. The functions are self-explanatory already. They are used to read a CSV or an excel file to a pandas DataFrame format. Here I am using the read_csv function to read the FIFA dataset:

df = pd.read_csv("fifa.csv")

I have a detailed video on the read_csv function. It has several great parameters that help to clean up the dataset a little bit when you read the dataset.

A .txt file can also be read using .read_csv() function using the following syntax:

data = pd.read_csv(file.txt, sep=" ")

If you have an excel file instead of a csv file you will use pd.read_excel.

It is also common to use .head() function after read_csv or read_excel to see the data frame. By default, it shows the first 5 rows of the DataFrame. Here I am showing the first five rows of the DataFrame df above:

df.head()

If you want a specific number of rows instead of five rows, you can specify that. If I wanted 7 rows I would have mentioned in the .head() function as a parameter.

df.head(7)

2. df.columns

When you have a big dataset like that it can be hard to see all the columns. using .columns function, you can print out all the columns of the dataset:

df.columns

Output:

Index(['Unnamed: 0', 'sofifa_id', 'player_url', 'short_name', 'long_name', 'age', 'dob', 'height_cm', 'weight_kg', 'nationality', 'club_name', 'league_name', 'league_rank', 'overall', 'potential', 'value_eur', 'wage_eur', 'player_positions', 'preferred_foot', 'international_reputation', 'weak_foot', 'skill_moves', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags', 'team_position', 'team_jersey_number', 'loaned_from', 'joined', 'contract_valid_until', 'nation_position', 'nation_jersey_number', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning', 'player_traits', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes'], dtype='object')

3. df.drop()

You can drop some unnecessary columns using df.drop(). In this dataset we have so many columns we are not going to use all of them for this tutorial. So, we can easily drop some:

df = df.drop(columns=['Unnamed: 0', 'weak_foot', 'real_face'])

I just dropped these three columns: ‘Unnamed: 0’, ‘weak_foot’, ‘real_face’.

4. .len()

Provides with the length of the DataFrame. Let’s see an example:

len(df)

Output:

16155

This DataFrame has 16155 rows of data.

5. df.query()

You can filter or query using a boolean expression. I will use ‘shooting’ and ‘passing’ columns for this example. Here I am checking for which rows ‘shooting’ is bigger than ‘passing’.

df.query("shooting > passing")

This will return the rows only where the shooting is bigger than passing.

6. df.iloc()

This function takes as a parameter the rows and column indices and gives you the subset of the DataFrame accordingly. Here I am taking the first 10 rows of data and index 5th to index 10th columns:

df.iloc[:10, 5:10]

7. df.loc()

This function does almost the similar operation as .iloc() function. But here we can specify exactly which row index we want and also the name of the columns we want in our subset. Here is an example:

df.loc[[3, 10, 14, 23], ['nationality', 'weight_kg', "height_cm"]]

Look at the row indices. We only have the 3rd, 10th, 14th, and 23rd rows. On the other hand, for columns, we only have the specified columns.

8. df[‘’].dtypes

Another very basic and widely used functions. Because it is necessary to know the data types of the variables before we dive into the analysis, visualization, or predictive modeling. I am getting the data type of the ‘height_cm’ column using .dtypes function here:

df.height_cm.dtypes

Output:

dtype('int64')

You have the option to get the data type of each and every column as well using this syntax:

df.dtypes

Output:

height_cm                     int64
weight_kg                     int64
nationality                  object
random_col                    int32
club_name                    object
league_name                  object
league_rank                 float64
overall                       int64
potential                     int64
value_eur                     int64
wage_eur                      int64
player_positions             object
preferred_foot               object
international_reputation      int64
skill_moves                   int64
work_rate                    object
body_type                    object
team_position                object
team_jersey_number          float64
nation_position              object
nation_jersey_number        float64
pace                        float64
shooting                    float64
passing                     float64
dribbling                   float64
defending                   float64
physic                      float64
cumsum_2                      int64
rank_calc                   float64
dtype: object

9. df.select_dtypes()

You can select the variables or columns of a certain data type using this function. For example, I want to select the columns with data types ‘int64’ only. Here is how to do that:

df.select_dtypes(include='int64')

We got all the columns that have the data type ‘int64’. If we use ‘exclude’ instead of ‘include’ in the ‘select_dtypes’ function, we will get the columns that do not have the data type ‘int64’:

df.select_dtypes(exclude='int64')

Here is part of the output. Look, the variables are not integers. You may think that the ‘random_col’ column is integers. But if you check its data type, you will see that it looks integers but its data type is different. Please feel free to check.

10. df.insert()

As the name of the function suggests, it inserts a column in the specified position. To demonstrate that I will first create an array of random numbers that have the length of our DataFrame:

random_col = np.random.randint(100, size=len(df))

I will insert this array as a column in the DataFrame df at column 3 position. Remember, the column index starts from zero.

df.insert(3, 'random_col', random_col)

Here is the part of the DataFrame again:

df.head()
Script for Ads:

Look, the column ‘random_col’ is inserted at position three.

11. df[‘’].cumsum()

It provides you with the cumulative sum. Let me explain with an example. I am going to use the ‘value_eur’ and ‘wage_eur’columns for this example. Here is the code:

df[['value_eur', 'wage_eur']].cumsum()

Output:

As you can see in every row it provides you with the cumulative sum of all the values of the previous rows.

12. df.sample()

When the size of the dataset is too big, you can take a representative sample from it to perform the analysis and predictive modeling. That may save you some time. Also, too much data may ruin the visualization sometimes. we can use this function to get a certain number of data points or a certain fraction or data point. Here I am taking a sample of 200 data points from the FIFA dataset. It takes a random sample.

df.sample(n = 200)

I am taking 25% of the FIFA dataset here:

df.sample(frac = 0.25)

13. df[‘’].where()

This function helps you query a dataset based on a boolean condition. For an example, the random_col we made before has the values ranging from 0 to 100. Here is how we make a series to see which of them are bigger than 50.

df['random_col'].where(df['random_col'] > 50)

Output:

0         NaN
1         NaN
2        56.0
3         NaN
4         NaN
         ... 
16150    65.0
16151     NaN
16152     NaN
16153    57.0
16154     NaN
Name: random_col, Length: 16155, dtype: float64

Look, where the values do not meet the condition that means the value is not greater than 50, returns NaN. We can replace NaN with 0 or any other value using this syntax:

df['random_col'].where(df['random_col'] > 50, 0)

Output:

0         0
1         0
2        56
3         0
4         0
         ..
16150    65
16151     0
16152     0
16153    57
16154     0
Name: random_col, Length: 16155, dtype: int32

14. df[‘’].unique()

This is very useful where we have categorical variables. It is used to find out the unique values of a categorical column. Let’s see what are the unique values of the ‘skill_moves’ column in our FIFA dataset:

df.skill_moves.unique()

Output:

array([4, 5, 1, 3, 2], dtype=int64)

So, we have five unique values in the skill_moves columns. If we print out the head of the dataset to check out the values of the columns you may not see all the unique values in it. So, to know all the unique values .unique() function comes out really handy.

15. df[‘’].nunique()

Another popular function. This function lets you know how many unique values do you have in a column. As an example, if you want to see how many different nationalities are there in this dataset, you can use this simple line of code

df.nationality.nunique()

Output:

149

The great thing is, this function can be used on the total dataset as well to know the number of unique values in each column:

df.nunique()

Output:

height_cm                      48
weight_kg                      54
nationality                   149
random_col                    100
club_name                     577
league_name                    37
league_rank                     4
overall                        53
potential                      49
value_eur                     161
wage_eur                       41
player_positions              907
preferred_foot                  2
international_reputation        5
skill_moves                     5
work_rate                       9
body_type                       3
team_position                  29
team_jersey_number             99
nation_position                28
nation_jersey_number           26
pace                           74
shooting                       70
passing                        67
dribbling                      67
defending                      69
physic                         63
cumsum_2                    14859
rank_calc                     161
dtype: int64

Here we have the number of unique values in each column.

16. df[‘’].rank()

This function provides you with the rank based on a certain column. In the FIFA dataset, if we want to rank the players based on the ‘value_eur’ column, here is the syntax for that:

df['rank_calc'] = df["value_eur"].rank()

Using the line of code above, I created a new column named ‘rank_calc’. This new column will give you the ranks of each player based on the ‘value_eur’. The column will be added at the end by default. Please run the line of code by yourself to check.

Script for Ads:

17. .isin()

I am going to make a subset of the dataset that will contain only a few nationalities of players using .isin() function.

nationality = ["Argentina", "Portugal", "Sweden", "England"]
df[df.nationality.isin(nationality)]

If you run this code you will see we have the resulting dataset containing only those few countries mentioned in the list above. You can see the part of the dataset here:

18. df.replace()

It does exactly what it sounds like. It replaces the values of a column. When we need to replace only one unique value of a column we simply need to pass the old value and the new value. Imagine, we just found out that the ‘league_rank’ 1.0 needs to be replaced by 1.1 now. Here is how to do that:

df.replace(1.0, 1.1)

Look at the league_rank column in the dataset now, 1.0 is replaced by 1.1. If we need to change more than one value, we can pass a dictionary to the replace function where the key should be the original value and the value should be the replacement.

df.replace({1.0: 1.1,  4.0: 4.1, 3.0: 3.1})

19. df.rename()

It is used to rename the column/s. Here I am changing the ‘weight_kg’ and ‘height_cm’ columns to “Weight (kg)” and “Height (cm)”:

df.rename(columns = {"weight_kg": "Weight (kg)", "height_cm": "Height (cm)"})

Very simple and useful!

20. .fillna()

Whenever you will receive a big dataset in real life, there will be some null values in most cases. It is really hard to get a perfect dataset. So, filling up the null values is part of your daily task if you are a data analyst or a data scientist. This function .fillna() replaces the null values with some other value of your choice. Here are some of the columns towards the end of the FIFA dataset:

Look, there are some null values in shooting, passing, defending, and some other columns. We really need to replace those null values with some values of compatible data types before we start doing any predictive modeling and also some other data science tasks. Otherwise, we may get errors. For example in the ‘pace’ column, the values should be numeric but here and there you will see NaN values. The most generic but not so efficient way is to replace those NaN values with zeros. Here is the way to change the all the NaN values of the ‘pace’ column with zeros:

df['pace'].fillna(0, inplace=True)

If you notice, the NaN in the pace column is zero now. In the total pace column, if there are more NaN values they should also be replaced by zeros.

As I mentioned before replacing by zero may not be the most efficient way. You can replace it with some other value of your choice. It is also common to replace values with the mean or median. If we wanted to replace the NaN values of the pace column with the mean of space column we would have used this line of code instead:

df['pace'].fillna(df['pace'].mean(), inplace = True)

21. df.groupby()

This is the most popular function for data summarizing. You can group the data as per a certain variable and find out useful information about those groups. For example, here I am grouping the data by nationality and calculating the total ‘value_eur’ for each nationality:

df.groupby("nationality")['value_eur'].sum()

Output:

nationality
Albania                25860000
Algeria                70560000
Angola                  6070000
Antigua & Barbuda       1450000
Argentina            1281372000
                        ...    
Uzbekistan              7495000
Venezuela              41495000
Wales                 113340000
Zambia                  4375000
Zimbabwe                6000000
Name: value_eur, Length: 149, dtype: int64

The sum of ‘value_eur’ for all the players of Albania is 25860000.

It is also possible to group by several variables and use several aggregate functions. We will see for each nationality and each league rank’s mean value_eur, median value_eur, mean wage_eur, and median wage_eur.

df.groupby(['nationality', 'league_rank'])['value_eur', 'wage_eur'].agg([np.mean, np.median])

Output:

22. .pct_change()

You can get the percent change from the previous value of a variable. For this demonstration, I will use the value_eur column and get the percent change from the previous for each row of data. The first row will be NaN because there is no value to compare before.

df.value_eur.pct_change()

Output

0             NaN
1       -0.213930
2       -0.310127
3       -0.036697
4        0.209524
           ...   
16150    0.000000
16151    0.500000
16152   -0.500000
16153    0.000000
16154   -1.000000
Name: value_eur, Length: 16155, dtype: float64

You may not feel this as important in this dataset.

But think of some financial data. Specially when you have stock market value of everyday. How nice it would be to see the percent change in every day value.

Script for Ads:

23. df.count()

It provides you the number of data in the DataFrame in the specified direction. When the direction is 0, it provides the number of data in the columns:

df.count(0)

Output:

Unnamed: 0                 16155
sofifa_id                  16155
player_url                 16155
short_name                 16155
long_name                  16155
                           ...  
goalkeeping_diving         16155
goalkeeping_handling       16155
goalkeeping_kicking        16155
goalkeeping_positioning    16155
goalkeeping_reflexes       16155
Length: 81, dtype: int64

You can see the number of data in each column.

When the direction is 1, it provides the number of data in the rows:

df.count(1)

Output:

0        72
1        72
2        72
3        72
4        71
         ..
16150    68
16151    68
16152    68
16153    68
16154    69
Length: 16155, dtype: int64

As you can see, each row does not have the same number of data. If you observe the dataset carefully, you will see that it has a lot of null values in several columns.

24. df[‘’].value_counts()

We can get the value counts of each category using this function. Here I am getting how many values are there in each league_rank.

df['league_rank'].value_counts()

Output:

1.0    11738
2.0     2936
3.0      639
4.0      603
Name: league_rank, dtype: int64

It returns the result sorted by default. If you want the result in ascending order, simply set ascending=True:

df['league_rank'].value_counts(ascending=True)

Output:

4.0      603
3.0      639
2.0     2936
1.0    11738
Name: league_rank, dtype: int64

25. pd.crosstab()

It gives you a frequency table that is a cross-tabulation of two variables. I am making a cross-tabulation of league_rank and international_reputation here:

pd.crosstab(df['league_rank'], df['international_reputation'])

So, we got the number count of all the combinations of league_rank and international_reputation. We can see that the majority of players have international_reputation and league_rank both 1.

It can be improved further. We can add margins in both directions that will be the total and also we can get the normalized values if necessary:

pd.crosstab(df['league_rank'], df['international_reputation'], 
            margins = True,
            margins_name="Total",
            normalize = True)
 

26. pd.qcut()

This function bins the data or segments the data based on the distribution of the data. So, we get the range for each player. Here I am going to segment the value_eur in 5 portions and get which player falls in which portion:

pd.qcut(df['value_eur'], q = 5)

Output:

0        (1100000.0, 100500000.0]
1        (1100000.0, 100500000.0]
2        (1100000.0, 100500000.0]
3        (1100000.0, 100500000.0]
4        (1100000.0, 100500000.0]
                   ...           
16150          (-0.001, 100000.0]
16151          (-0.001, 100000.0]
16152          (-0.001, 100000.0]
16153          (-0.001, 100000.0]
16154          (-0.001, 100000.0]
Name: value_eur, Length: 16155, dtype: category
Categories (5, interval[float64]): [(-0.001, 100000.0] < (100000.0, 230000.0] < (230000.0, 500000.0] < (500000.0, 1100000.0] < (1100000.0, 100500000.0]]

You can use the value_counts on the above line of code to see how players fall in which range:

pd.qcut(df['value_eur'], q = 5).value_counts()

Output:

(-0.001, 100000.0]          3462
(230000.0, 500000.0]        3305
(100000.0, 230000.0]        3184
(500000.0, 1100000.0]       3154
(1100000.0, 100500000.0]    3050
Name: value_eur, dtype: int64

As you can see the numbers are pretty close. By default, qcut tries to divide them equally. But in real life, it doesn’t want to be equal always. Because the distribution is not uniform most of the time.

27. pd.cut()

Another method for binning. If we want to make 5 bins using cut, it will divide the entire value_eur range into equal five portions and the population in each bin will follow accordingly.

pd.cut(df['value_eur'], bins = 5).value_counts()

Output:

(-100500.0, 20100000.0]      16102 
(20100000.0, 40200000.0]        40 
(40200000.0, 60300000.0]        10 
(60300000.0, 80400000.0]         2 
(80400000.0, 100500000.0]        1 
Name: value_eur, dtype: int64

The interval in each range is equal. But the population in each group is very different.

28. df[‘’].describe()

This is a great function that provides some basic statistical measures. Here I am using the describe function on the wage_eur column:

df['wage_eur'].describe()

Output:

count     16155.000000
mean      13056.453110
std       23488.182571
min           0.000000
25%        2000.000000
50%        5000.000000
75%       10000.000000
max      550000.000000
Name: wage_eur, dtype: float64

As the output shows, we have eight different measures. Each of them is very significant.

29. nlargest and nsmallest

This gives you the dataset with n number of largest values or smallest values of a specified variable. As an example, I wanted to get the rows with the top 5 wage_eur:

df.nlargest(5, "wage_eur")

In the same way, I can make a subset of the dataset with the 5 smallest wage_eur data:

df.nsmallest(5, "wage_eur")

30. df.explode()

Explode can be useful when you have a list of data in some rows. It is hard to analyze, visualize or perform some predictive modeling when you have integers in some columns and lists in some columns. Explode helps to break down those lists. For example, look at this DataFrame:

df1 = pd.DataFrame({"city": ['A', 'B', 'C'],
                   "day1": [22, 25, 21],
                   'day2':[31, 12, 67],
                   'day3': [27, 20, 15],
                   'day4': [34, 37, [41, 45, 67, 90, 21]],
                   'day5': [23, 54, 36]})
df1

Let’s explode column d4:

df1.explode(jupyter notebook
'day4').reset_index(drop=True)

Conclusion

Python’s panda’s library is so big. There are so many functions. I choose some important functions in everyday life. If you know these ones very well you will be able to perform most analysis tasks successfully. Panda’s has one more very useful function that I didn’t mention here that is .plot() function. You can plot using pandas only. Pandas use Matplotlib in the backend and return the plot for you. I have a detailed tutorial on that here.

Hopefully, this article was helpful.

Please feel free to follow me on Twitter, the Facebook page, and check out my new YouTube channel

Script for Ads:



#DataScience #DataAnalytics #Pandas #Python #DataAnalysis

Leave a Reply

Close Menu