Data cleaning is an essential part of your life if you are a data scientist, data analyst, or machine learning engineer. In real life, it is very hard to find data that are totally clean and ready to use right away. Before diving into the analysis, visualization, or machine learning, data cleaning is a very common task for almost all real-world projects.
This article will focus on the general and very common data cleaning tasks. We will use a few different datasets to demonstrate different data cleaning processes.
No more introduction. Let’s get into the action!
First, I will start with the olympics.csv dataset. This is a really common dataset to use for practice. The first dataset I will use is the Olympics dataset. I found this dataset from a Coursera course. I had to do some cleaning before use. Here is the link for the dataset.
Here I am importing the necessary packages, reading the dataset into a DataFrame format using the ‘read_csv’ function, and displaying the first five rows of the data using the .head() function.
import pandas as pd import numpy as np
olympics = pd.read_csv("olympics.csv") olympics.head()

Finding the right header
In the DataFrame above top row is clearly not the header. The header is actually the next row where the row index is 1. So, an extra parameter ‘header=1’ can be used to do that in the read_csv method. Here I am reading the dataset again:
olympics = pd.read_csv("olympics.csv", header=1)
olympics.head()

Renaming the columns
The column names are not very clean-looking and understandable here. The first column contains all the country’s names. But it has a strange name! Where it says ‘01!’, ‘02!’, or ‘03!’, it should be ‘Gold’, ‘Silver’, and ‘Bronze’. So, let’s rename the columns using the ‘rename’ function. The ‘rename’ function takes a dictionary where the key is the original column name and the value is the new column name.
col_names = {'Unnamed: 0': 'Country', '? Summer': 'Summer Olympics', '01 !': 'Gold', '02 !': 'Silver', '03 !': 'Bronze', '? Winter': 'Winter Olympics', '01 !.1': 'Gold.1', '02 !.1': 'Silver.1', '03 !.1': 'Bronze.1', '? Games': '# Games', '01 !.2': 'Gold.2', '02 !.2': 'Silver.2', '03 !.2': 'Bronze.2'}
olympics.rename(columns=col_names, inplace=True)#Checking the olympics DataFrame with the new column names olympics.head()

Column names are fixed!
Getting rid of extra tails in the texts
Look at the country names in the first column. After country names, it has an abbreviation in the brackets. Sometimes it even has twice. That is unnecessary. It will not look good in visualization or any analysis. That needs to be removed.
We will use ‘applymap’ function for doing that. The ‘applymap’ function takes a function that should be applied on the column or columns we intend to perform the changes. Here we define the function that will take a string as a parameter and remove everything after it finds the first bracket ‘(‘.
def get_country(st):
if ' (' in st:
return st[:st.find(' (')]
else:
return st
We will use this function as a parameter in the ‘applymap’ function.
pd.DataFrame(olympics['Country']).applymap(get_country)

As you can see, in the country column, we only have the country name now. If we would have more than one column with the same cleaning requirement, we could use ‘applymap’ on all the columns at the same time.
This dataset is now ready for analysis or visualization!
My next dataset will be an arthritis dataset. I got this dataset as an assignment in one of my courses during my master’s program. It is in ‘arff’ format. Here is the link to this dataset.
If you are not familiar with this format, don’t worry about it. Our focus is on data cleaning. Here is how to load this ‘arff’ formatted dataset into a pandas DataFrame:
from scipy.io import arff
d = arff.loadarff("project-2018-BRFSS-arthritis.arff") df = pd.DataFrame(d[0])

This dataset has 108 columns and 11933 rows. So, quite a big dataset!
Converting the byte type data to the numeric data
The first enemy in this dataset is b’2′ type data. These are bytes type data that are not friendly for any data-related project. We want simple numeric data. Even if we get text data, we convert them to numeric data somehow, right? So, this byte-type data needs to be converted to some more manageable format.
First I will decode them into a string form.
The following function takes a DataFrame as a parameter, checks if there are columns with byte type data, and then decodes them into a string type.
def convert_str(df): for i in df.columns: if type(df[i][0]) == bytes: df[i] = df[i].str.decode("utf-8") return dfdf = convert_str(df) df

The byte-type data are changed into string types. But still, there is a problem. Look at the last value of the column above. It is a ‘?’. There might be more ‘?’ like that in different locations that we cannot check one by one in a big dataset. Using a loop, we will check each column. If there is any ‘?’, it will be replaced by an empty string.
for i in df.columns:
if type(df[i][0]) == str:
df[i] = df[i].str.replace('?', '')
df

Look at the last value of the first column again, that ‘?’ is gone!
But that also does not solve the problem. Because all those byte type data are converted into the string type. Ideally, we need numeric data. Let’s now convert them to numeric data.
col = []
for i in df.columns:
if type(df[i][0]) == str:
col.append(i)
df[col] = df[col].apply(pd.to_numeric, errors = 'coerce')

The last row shows several null values and the whole dataset may contain a lot more null values. We should check the count of null values in each column.
df.isnull().sum()
Output:
x.aidtst3 805
employ1 34
income2 114
weight2 180
height3 198
...
x.michd 114
x.ltasth1 0
x.casthm1 0
x.state 0
havarth3 0
Length: 108, dtype: int64
We have a lot of null values to deal with.
Dealing with the null values
There are a few commonly used ways to deal with null values. One way is to drop all the null values using this simple piece of code that will simply delete all the rows with any null values:
df.dropna()
The problem is, in this way you may be left with only a few rows of data. If the dataset is really large and even after deleting the null values, you have enough data to analyze, it might be ok. But I do not encourage you to delete the whole row of data like that. You may lose important information.
Another way is to fill up all the null values with zeros like this:
df.fillna(0)
This will fill up all the null values with zeros. If that serves your purpose, go with it.
The null values can also be filled with the value immediately before that:
df.fillna(method = "bfill")
In the same way, you can fill the null values with the value exactly after that:
df.fillna(method="ffill")
These are all the options available. Lastly, I will show a way I like the most. I usually fill the null values with the median. It is also common to fill up the null values with mean or standard normal values. So, we will loop through the columns and fill the null values of all the columns using their median.
for i in df.columns:
df[i].fillna(df[i].median(), inplace=True)
Let’s check the number of null values in each column again:
df.isnull().sum()
Output:
x.aidtst3 0
employ1 0
income2 0
weight2 0
height3 0
..
x.michd 0
x.ltasth1 0
x.casthm1 0
x.state 0
havarth3 0
Length: 108, dtype: int64
No more null values!
This dataset should be good to go now.
Next, I will have a dataset where each column will need some cleaning. This dataset is prepared specially for this tutorial. I added some very common types of problems to solve here.
Here is the link to the dataset.
Here is what it looks like:
ppl = pd.read_csv("people.csv")
ppl

You can see almost every column has something to work on.
Removing the duplicate data
If you notice, some people came in twice. ‘Alice’ and ‘Rumy’ data are duplicates. That is common. It could be typed twice or maybe two different people took their data. Let’s get rid of duplicate data before any other data cleaning:
people = people.drop_duplicates(subset="Name")
people

Duplicates are gone! Let’s work on the columns one by one.
Getting simple four-digit year data
Starting with the ‘Year’ column, some of the values are not a 4 digit year. It has some extra values that should be removed before analysis.
We will use a regular expression to get only the 4 digit year value:
year_ex = people["Year"].str.extract(r'^(\d{4})',
expand=False)
year_ex
Output:
0 1980
1 1978
2 1982
3 1992
4 1987
Name: Year, dtype: object
We got simple 4 digit years!
Evaluating the income in plain numeric values
Now comes the ‘Income’ column where we have string values that contain ‘K’, ‘M’. For analysis or visualization with income values, numeric values will be more useful. So, this ‘K’, and ‘M’ needs to be replaced by three zeros and six zeros:
people["Income"].replace({"K": "*1e3", "M": "*1e6"}, regex=True).map(pd.eval).astype(int)
Output:
0 2000000
1 40000
2 120000
3 10000000
4 6000000
Name: Income, dtype: int32
In the education column, the words Harvard, Stanford or Oxford comes with some extra words or characters. If you want to use this data in analysis, two Harvard data will be considered as 2 different data. So, we need to get rid of those extra characters and only have simple and exactly the same strings for the same university. Here is how I did it.
First, make three boolean variables for three different Universities.
edu = people["Education"]
harvard = edu.str.contains("Harvard")
stanford = edu.str.contains("Stanford")
oxford = edu.str.contains("Oxford")
Now, we will use ‘Harvard’ if the string contains “Harvard” use “Stanford” where the string contains “Stanford” and so on with nested np.where() functions.
people["Education"] = np.where(harvard, "Harvard",
np.where(stanford, "Stanford",
np.where(oxford, "Oxford", people["Education"])))
Output:
0 Harvard
1 Oxford
2 Oxford
3 Stanford
4 Harvard
Name: Education, dtype: object
Look! We have very simple one-word names now. No other confusion!
Converting the time series data to DateTime format
Now, the “Graduation” column. To analyze the time or date data in pandas, it is useful to have ‘datetime’ format. I feel like that’s also more clear. This is a common data cleaning task if there is time-series data. A lot of time, they do not come in a friendly format.
It is very simple to convert it to a ‘datetime’ format:
pd.to_datetime(people['Graduation'])
Output:
0 2020-03-12
1 2020-04-14
2 2020-08-01
3 2020-05-18
4 2021-05-10
Name: Graduation, dtype: datetime64[ns]
Done!
Cleaning up the numbers or special characters and have only pure texts
Lastly, this ‘Standing’ column. Almost always, text data requires some data cleaning. Typing mistakes, unnecessary numbers, and characters come in all the time. It almost never happens that, you get text data very clean. To start with we will make all the strings lower case. That way ‘Low’ and ‘low’ will not be different.
people["Standing"] = people["Standing"].str.lower()
I will get rid of all the numbers and other characters using a simple regular expression:
people["Standing"].map(lambda x: re.sub('([^a-z]+)', '', x))
Output:
0 medium
1 low
2 excellant
3 low
4 excellant
Name: Standing, dtype: object
It’s clean and simple!
Conclusion
Data may come in so many different ways. So many different data cleaning options are out there. In this article, I wanted to list some common data cleaning tasks for everyday data cleaning. It can save a lot of time to have a list like this. In the future, I will add more data cleaning options hopefully.
Please feel free to follow me on Twitter, the Facebook page, and check out my new YouTube channel
#DataScience #DataAnalytics #MachineLearning #python #pandas #DataCleaning