Python’s Pandas library is the most widely used library in Python. Because this is the data manipulation library that is necessary for every aspect of data analysis or machine learning. Even if you are working on data visualization or machine learning, some data manipulation will be there anyway. In this article, I will list the Pandas functions that are necessary for everyday use and arguably will be enough to perform the regular data manipulation tasks.
For this article, I will use a public dataset from Kaggle called the FIFA dataset.
The user license is mentioned here.
Please feel free to download the dataset from here.
Here’s the fun begins!
I am importing the necessary packages and the dataset:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
Let’s start talking about the functions:
1. pd.read_csv, pd.read_excel
The first function to mention is read_csv or read_excel. Till now I used at least one of these functions in every project. The functions are self-explanatory already. They are used to read a CSV or an excel file to a pandas DataFrame format. Here I am using the read_csv function to read the FIFA dataset:
df = pd.read_csv("fifa.csv")
I have a detailed video on the read_csv function. It has several great parameters that help to clean up the dataset a little bit when you read the dataset.
A .txt file can also be read using .read_csv() function using the following syntax:
data = pd.read_csv(file.txt, sep=" ")
If you have an excel file instead of a csv file you will use pd.read_excel.
It is also common to use .head() function after read_csv or read_excel to see the data frame. By default, it shows the first 5 rows of the DataFrame. Here I am showing the first five rows of the DataFrame df above:
df.head()

If you want a specific number of rows instead of five rows, you can specify that. If I wanted 7 rows I would have mentioned in the .head() function as a parameter.
df.head(7)
2. df.columns
When you have a big dataset like that it can be hard to see all the columns. using .columns function, you can print out all the columns of the dataset:
df.columns
Output:
Index(['Unnamed: 0', 'sofifa_id', 'player_url', 'short_name', 'long_name', 'age', 'dob', 'height_cm', 'weight_kg', 'nationality', 'club_name', 'league_name', 'league_rank', 'overall', 'potential', 'value_eur', 'wage_eur', 'player_positions', 'preferred_foot', 'international_reputation', 'weak_foot', 'skill_moves', 'work_rate', 'body_type', 'real_face', 'release_clause_eur', 'player_tags', 'team_position', 'team_jersey_number', 'loaned_from', 'joined', 'contract_valid_until', 'nation_position', 'nation_jersey_number', 'pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_reflexes', 'gk_speed', 'gk_positioning', 'player_traits', 'attacking_crossing', 'attacking_finishing', 'attacking_heading_accuracy', 'attacking_short_passing', 'attacking_volleys', 'skill_dribbling', 'skill_curve', 'skill_fk_accuracy', 'skill_long_passing', 'skill_ball_control', 'movement_acceleration', 'movement_sprint_speed', 'movement_agility', 'movement_reactions', 'movement_balance', 'power_shot_power', 'power_jumping', 'power_stamina', 'power_strength', 'power_long_shots', 'mentality_aggression', 'mentality_interceptions', 'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking', 'defending_standing_tackle', 'defending_sliding_tackle', 'goalkeeping_diving', 'goalkeeping_handling', 'goalkeeping_kicking', 'goalkeeping_positioning', 'goalkeeping_reflexes'], dtype='object')
3. df.drop()
You can drop some unnecessary columns using df.drop(). In this dataset we have so many columns we are not going to use all of them for this tutorial. So, we can easily drop some:
df = df.drop(columns=['Unnamed: 0', 'weak_foot', 'real_face'])
I just dropped these three columns: ‘Unnamed: 0’, ‘weak_foot’, ‘real_face’.
4. .len()
Provides with the length of the DataFrame. Let’s see an example:
len(df)
Output:
16155
This DataFrame has 16155 rows of data.
5. df.query()
You can filter or query using a boolean expression. I will use ‘shooting’ and ‘passing’ columns for this example. Here I am checking for which rows ‘shooting’ is bigger than ‘passing’.
df.query("shooting > passing")
This will return the rows only where the shooting is bigger than passing.
6. df.iloc()
This function takes as a parameter the rows and column indices and gives you the subset of the DataFrame accordingly. Here I am taking the first 10 rows of data and index 5th to index 10th columns:
df.iloc[:10, 5:10]

7. df.loc()
This function does almost the similar operation as .iloc() function. But here we can specify exactly which row index we want and also the name of the columns we want in our subset. Here is an example:
df.loc[[3, 10, 14, 23], ['nationality', 'weight_kg', "height_cm"]]

Look at the row indices. We only have the 3rd, 10th, 14th, and 23rd rows. On the other hand, for columns, we only have the specified columns.
8. df[‘’].dtypes
Another very basic and widely used functions. Because it is necessary to know the data types of the variables before we dive into the analysis, visualization, or predictive modeling. I am getting the data type of the ‘height_cm’ column using .dtypes function here:
df.height_cm.dtypes
Output:
dtype('int64')
You have the option to get the data type of each and every column as well using this syntax:
df.dtypes
Output:
height_cm int64
weight_kg int64
nationality object
random_col int32
club_name object
league_name object
league_rank float64
overall int64
potential int64
value_eur int64
wage_eur int64
player_positions object
preferred_foot object
international_reputation int64
skill_moves int64
work_rate object
body_type object
team_position object
team_jersey_number float64
nation_position object
nation_jersey_number float64
pace float64
shooting float64
passing float64
dribbling float64
defending float64
physic float64
cumsum_2 int64
rank_calc float64
dtype: object
9. df.select_dtypes()
You can select the variables or columns of a certain data type using this function. For example, I want to select the columns with data types ‘int64’ only. Here is how to do that:
df.select_dtypes(include='int64')

We got all the columns that have the data type ‘int64’. If we use ‘exclude’ instead of ‘include’ in the ‘select_dtypes’ function, we will get the columns that do not have the data type ‘int64’:
df.select_dtypes(exclude='int64')

Here is part of the output. Look, the variables are not integers. You may think that the ‘random_col’ column is integers. But if you check its data type, you will see that it looks integers but its data type is different. Please feel free to check.
10. df.insert()
As the name of the function suggests, it inserts a column in the specified position. To demonstrate that I will first create an array of random numbers that have the length of our DataFrame:
random_col = np.random.randint(100, size=len(df))
I will insert this array as a column in the DataFrame df at column 3 position. Remember, the column index starts from zero.
df.insert(3, 'random_col', random_col)
Here is the part of the DataFrame again:
df.head()

Look, the column ‘random_col’ is inserted at position three.
11. df[‘’].cumsum()
It provides you with the cumulative sum. Let me explain with an example. I am going to use the ‘value_eur’ and ‘wage_eur’columns for this example. Here is the code:
df[['value_eur', 'wage_eur']].cumsum()
Output:

As you can see in every row it provides you with the cumulative sum of all the values of the previous rows.
12. df.sample()
When the size of the dataset is too big, you can take a representative sample from it to perform the analysis and predictive modeling. That may save you some time. Also, too much data may ruin the visualization sometimes. we can use this function to get a certain number of data points or a certain fraction or data point. Here I am taking a sample of 200 data points from the FIFA dataset. It takes a random sample.
df.sample(n = 200)
I am taking 25% of the FIFA dataset here:
df.sample(frac = 0.25)
13. df[‘’].where()
This function helps you query a dataset based on a boolean condition. For an example, the random_col we made before has the values ranging from 0 to 100. Here is how we make a series to see which of them are bigger than 50.
df['random_col'].where(df['random_col'] > 50)
Output:
0 NaN
1 NaN
2 56.0
3 NaN
4 NaN
...
16150 65.0
16151 NaN
16152 NaN
16153 57.0
16154 NaN
Name: random_col, Length: 16155, dtype: float64
Look, where the values do not meet the condition that means the value is not greater than 50, returns NaN. We can replace NaN with 0 or any other value using this syntax:
df['random_col'].where(df['random_col'] > 50, 0)
Output:
0 0
1 0
2 56
3 0
4 0
..
16150 65
16151 0
16152 0
16153 57
16154 0
Name: random_col, Length: 16155, dtype: int32
14. df[‘’].unique()
This is very useful where we have categorical variables. It is used to find out the unique values of a categorical column. Let’s see what are the unique values of the ‘skill_moves’ column in our FIFA dataset:
df.skill_moves.unique()
Output:
array([4, 5, 1, 3, 2], dtype=int64)
So, we have five unique values in the skill_moves columns. If we print out the head of the dataset to check out the values of the columns you may not see all the unique values in it. So, to know all the unique values .unique() function comes out really handy.
15. df[‘’].nunique()
Another popular function. This function lets you know how many unique values do you have in a column. As an example, if you want to see how many different nationalities are there in this dataset, you can use this simple line of code
df.nationality.nunique()
Output:
149
The great thing is, this function can be used on the total dataset as well to know the number of unique values in each column:
df.nunique()
Output:
height_cm 48
weight_kg 54
nationality 149
random_col 100
club_name 577
league_name 37
league_rank 4
overall 53
potential 49
value_eur 161
wage_eur 41
player_positions 907
preferred_foot 2
international_reputation 5
skill_moves 5
work_rate 9
body_type 3
team_position 29
team_jersey_number 99
nation_position 28
nation_jersey_number 26
pace 74
shooting 70
passing 67
dribbling 67
defending 69
physic 63
cumsum_2 14859
rank_calc 161
dtype: int64
Here we have the number of unique values in each column.
16. df[‘’].rank()
This function provides you with the rank based on a certain column. In the FIFA dataset, if we want to rank the players based on the ‘value_eur’ column, here is the syntax for that:
df['rank_calc'] = df["value_eur"].rank()
Using the line of code above, I created a new column named ‘rank_calc’. This new column will give you the ranks of each player based on the ‘value_eur’. The column will be added at the end by default. Please run the line of code by yourself to check.
Script for Ads:17. .isin()
I am going to make a subset of the dataset that will contain only a few nationalities of players using .isin() function.
nationality = ["Argentina", "Portugal", "Sweden", "England"]
df[df.nationality.isin(nationality)]
If you run this code you will see we have the resulting dataset containing only those few countries mentioned in the list above. You can see the part of the dataset here:

18. df.replace()
It does exactly what it sounds like. It replaces the values of a column. When we need to replace only one unique value of a column we simply need to pass the old value and the new value. Imagine, we just found out that the ‘league_rank’ 1.0 needs to be replaced by 1.1 now. Here is how to do that:
df.replace(1.0, 1.1)

Look at the league_rank column in the dataset now, 1.0 is replaced by 1.1. If we need to change more than one value, we can pass a dictionary to the replace function where the key should be the original value and the value should be the replacement.
df.replace({1.0: 1.1, 4.0: 4.1, 3.0: 3.1})

19. df.rename()
It is used to rename the column/s. Here I am changing the ‘weight_kg’ and ‘height_cm’ columns to “Weight (kg)” and “Height (cm)”:
df.rename(columns = {"weight_kg": "Weight (kg)", "height_cm": "Height (cm)"})

Very simple and useful!
20. .fillna()
Whenever you will receive a big dataset in real life, there will be some null values in most cases. It is really hard to get a perfect dataset. So, filling up the null values is part of your daily task if you are a data analyst or a data scientist. This function .fillna() replaces the null values with some other value of your choice. Here are some of the columns towards the end of the FIFA dataset:

Look, there are some null values in shooting, passing, defending, and some other columns. We really need to replace those null values with some values of compatible data types before we start doing any predictive modeling and also some other data science tasks. Otherwise, we may get errors. For example in the ‘pace’ column, the values should be numeric but here and there you will see NaN values. The most generic but not so efficient way is to replace those NaN values with zeros. Here is the way to change the all the NaN values of the ‘pace’ column with zeros:
df['pace'].fillna(0, inplace=True)

If you notice, the NaN in the pace column is zero now. In the total pace column, if there are more NaN values they should also be replaced by zeros.
As I mentioned before replacing by zero may not be the most efficient way. You can replace it with some other value of your choice. It is also common to replace values with the mean or median. If we wanted to replace the NaN values of the pace column with the mean of space column we would have used this line of code instead:
df['pace'].fillna(df['pace'].mean(), inplace = True)
21. df.groupby()
This is the most popular function for data summarizing. You can group the data as per a certain variable and find out useful information about those groups. For example, here I am grouping the data by nationality and calculating the total ‘value_eur’ for each nationality:
df.groupby("nationality")['value_eur'].sum()
Output:
nationality
Albania 25860000
Algeria 70560000
Angola 6070000
Antigua & Barbuda 1450000
Argentina 1281372000
...
Uzbekistan 7495000
Venezuela 41495000
Wales 113340000
Zambia 4375000
Zimbabwe 6000000
Name: value_eur, Length: 149, dtype: int64
The sum of ‘value_eur’ for all the players of Albania is 25860000.
It is also possible to group by several variables and use several aggregate functions. We will see for each nationality and each league rank’s mean value_eur, median value_eur, mean wage_eur, and median wage_eur.
df.groupby(['nationality', 'league_rank'])['value_eur', 'wage_eur'].agg([np.mean, np.median])
Output:

22. .pct_change()
You can get the percent change from the previous value of a variable. For this demonstration, I will use the value_eur column and get the percent change from the previous for each row of data. The first row will be NaN because there is no value to compare before.
df.value_eur.pct_change()
Output
0 NaN
1 -0.213930
2 -0.310127
3 -0.036697
4 0.209524
...
16150 0.000000
16151 0.500000
16152 -0.500000
16153 0.000000
16154 -1.000000
Name: value_eur, Length: 16155, dtype: float64
You may not feel this as important in this dataset.
Script for Ads:But think of some financial data. Specially when you have stock market value of everyday. How nice it would be to see the percent change in every day value.
23. df.count()
It provides you the number of data in the DataFrame in the specified direction. When the direction is 0, it provides the number of data in the columns:
df.count(0)
Output:
Unnamed: 0 16155
sofifa_id 16155
player_url 16155
short_name 16155
long_name 16155
...
goalkeeping_diving 16155
goalkeeping_handling 16155
goalkeeping_kicking 16155
goalkeeping_positioning 16155
goalkeeping_reflexes 16155
Length: 81, dtype: int64
You can see the number of data in each column.
When the direction is 1, it provides the number of data in the rows:
df.count(1)
Output:
0 72
1 72
2 72
3 72
4 71
..
16150 68
16151 68
16152 68
16153 68
16154 69
Length: 16155, dtype: int64
As you can see, each row does not have the same number of data. If you observe the dataset carefully, you will see that it has a lot of null values in several columns.
24. df[‘’].value_counts()
We can get the value counts of each category using this function. Here I am getting how many values are there in each league_rank.
df['league_rank'].value_counts()
Output:
1.0 11738
2.0 2936
3.0 639
4.0 603
Name: league_rank, dtype: int64
It returns the result sorted by default. If you want the result in ascending order, simply set ascending=True:
df['league_rank'].value_counts(ascending=True)
Output:
4.0 603
3.0 639
2.0 2936
1.0 11738
Name: league_rank, dtype: int64
25. pd.crosstab()
It gives you a frequency table that is a cross-tabulation of two variables. I am making a cross-tabulation of league_rank and international_reputation here:
pd.crosstab(df['league_rank'], df['international_reputation'])

So, we got the number count of all the combinations of league_rank and international_reputation. We can see that the majority of players have international_reputation and league_rank both 1.
It can be improved further. We can add margins in both directions that will be the total and also we can get the normalized values if necessary:
pd.crosstab(df['league_rank'], df['international_reputation'],
margins = True,
margins_name="Total",
normalize = True)

26. pd.qcut()
This function bins the data or segments the data based on the distribution of the data. So, we get the range for each player. Here I am going to segment the value_eur in 5 portions and get which player falls in which portion:
pd.qcut(df['value_eur'], q = 5)
Output:
0 (1100000.0, 100500000.0]
1 (1100000.0, 100500000.0]
2 (1100000.0, 100500000.0]
3 (1100000.0, 100500000.0]
4 (1100000.0, 100500000.0]
...
16150 (-0.001, 100000.0]
16151 (-0.001, 100000.0]
16152 (-0.001, 100000.0]
16153 (-0.001, 100000.0]
16154 (-0.001, 100000.0]
Name: value_eur, Length: 16155, dtype: category
Categories (5, interval[float64]): [(-0.001, 100000.0] < (100000.0, 230000.0] < (230000.0, 500000.0] < (500000.0, 1100000.0] < (1100000.0, 100500000.0]]
You can use the value_counts on the above line of code to see how players fall in which range:
pd.qcut(df['value_eur'], q = 5).value_counts()
Output:
(-0.001, 100000.0] 3462
(230000.0, 500000.0] 3305
(100000.0, 230000.0] 3184
(500000.0, 1100000.0] 3154
(1100000.0, 100500000.0] 3050
Name: value_eur, dtype: int64
As you can see the numbers are pretty close. By default, qcut tries to divide them equally. But in real life, it doesn’t want to be equal always. Because the distribution is not uniform most of the time.
27. pd.cut()
Another method for binning. If we want to make 5 bins using cut, it will divide the entire value_eur range into equal five portions and the population in each bin will follow accordingly.
pd.cut(df['value_eur'], bins = 5).value_counts()
Output:
(-100500.0, 20100000.0] 16102
(20100000.0, 40200000.0] 40
(40200000.0, 60300000.0] 10
(60300000.0, 80400000.0] 2
(80400000.0, 100500000.0] 1
Name: value_eur, dtype: int64
The interval in each range is equal. But the population in each group is very different.
28. df[‘’].describe()
This is a great function that provides some basic statistical measures. Here I am using the describe function on the wage_eur column:
df['wage_eur'].describe()
Output:
count 16155.000000
mean 13056.453110
std 23488.182571
min 0.000000
25% 2000.000000
50% 5000.000000
75% 10000.000000
max 550000.000000
Name: wage_eur, dtype: float64
As the output shows, we have eight different measures. Each of them is very significant.
29. nlargest and nsmallest
This gives you the dataset with n number of largest values or smallest values of a specified variable. As an example, I wanted to get the rows with the top 5 wage_eur:
df.nlargest(5, "wage_eur")

In the same way, I can make a subset of the dataset with the 5 smallest wage_eur data:
df.nsmallest(5, "wage_eur")

30. df.explode()
Explode can be useful when you have a list of data in some rows. It is hard to analyze, visualize or perform some predictive modeling when you have integers in some columns and lists in some columns. Explode helps to break down those lists. For example, look at this DataFrame:
df1 = pd.DataFrame({"city": ['A', 'B', 'C'],
"day1": [22, 25, 21],
'day2':[31, 12, 67],
'day3': [27, 20, 15],
'day4': [34, 37, [41, 45, 67, 90, 21]],
'day5': [23, 54, 36]})
df1

Let’s explode column d4:
df1.explode(jupyter notebook
'day4').reset_index(drop=True)

Conclusion
Python’s panda’s library is so big. There are so many functions. I choose some important functions in everyday life. If you know these ones very well you will be able to perform most analysis tasks successfully. Panda’s has one more very useful function that I didn’t mention here that is .plot() function. You can plot using pandas only. Pandas use Matplotlib in the backend and return the plot for you. I have a detailed tutorial on that here.
Hopefully, this article was helpful.
Please feel free to follow me on Twitter, the Facebook page, and check out my new YouTube channel
Script for Ads:
#DataScience #DataAnalytics #Pandas #Python #DataAnalysis