Null values are a big problem in machine learning and deep learning. If you are using sklearn, TensorFlow, or any other machine learning or deep learning packages, it is required to clean up null values before you pass your data to the machine learning or deep learning framework. Otherwise, it will give you a long and ugly error message.
In this article, we will work on how to handle null values. First, there will be some very simple methods and slowly we will move toward some more complex and more efficient ways.
To demonstrate the handling of null values, We will use the famous titanic dataset.
import pandas as pd import numpy as np import seaborn as sns titanic = sns.load_dataset("titanic") titanic
The preview is already showing some null values. Let’s check how many null values are there in each column:
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64
The age column has 177 and the embark_town column has 2 null values. But the deck column has the most null values 688 out of 891 rows of data. I would like to delete this column completely for machine learning or any other type of data analytics purpose.
We will focus on the age and embark_town columns and deal with the null values of those columns.
I will start with the most simple strategy and slowly move towards the more complicated tricks.
1. Simply Drop
The most simple strategy is to drop the rows with null values if you have a good enough size of data using this simple code:
But the titanic dataset is not too big. In fact in the real world also we do not have the luxury of deleting the data lots of time because we do not have enough data left after deleting all rows with null values.
2. Filling with Zeros
Another very easy and simple way. You can fill up all the null values with zeros to make the process really simple. We can fill up the null values in the age column with zeros like this:
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 0.0 889 26.0 890 32.0 Name: age, Length: 891, dtype: float64
Look at row 888. It was null before and now it is zero. This is also a very naive approach. Especially, in this case, age cannot be zero.
3. Forward and Backward Fill
This is also a common technique to fill up the null values. Forward fill means, the null value is filled up using the previous value in the series and backward fill means the null value is filled up with the next value in the series.
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 19.0 889 26.0 890 32.0 Name: age, Length: 891, dtype: float64
Please notice, that row 888 is now 19, the same as the 887.
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 26.0 889 26.0 890 32.0 Name: age, Length: 891, dtype: float64
This row 888 has taken the value from row 889.
4. Mean and Median Fill
I prefer this one in most cases, filling up the null values by mean and median. Here I am using the median:
titanic['age'].fillna(titanic['age'].median(), inplace = True) titanic['age']
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 28.0 889 26.0 890 32.0 Name: age, Length: 891, dtype: float64
Now the null value in row 888 is 28, which is the median of the column age.
5. Mean and Median Fill with Groupby
Putting the median or mean of the whole column was the simple approach. But I like a bit more specific approach to the median and mean. Instead of taking the median of the whole age column and filling up all the null values, filling up the null values using the mean age of each pclass and ‘alive’ will be more accurate.
Let’s see the mean age of ‘pclass’ and ‘alive’ if they vary at all:
pclass alive 1 no 43.695312 yes 35.368197 2 no 33.544444 yes 25.901566 3 no 26.555556 yes 20.646118 Name: age, dtype: float64
Yes, they do vary a lot.
It will be more accurate if we fill up the null values of each of these groups with the corresponding means.
0 22.00 1 38.00 2 26.00 3 35.00 4 35.00 ... 886 27.00 887 19.00 888 21.75 889 26.00 890 32.00 Name: age, Length: 891, dtype: float64
Now, row 888 became 21. Compared to 28
Categorical Null Values Imputation
The embark_town column is a categorical column and we can still apply the same process as the previous example with it.
But before that the values of the ‘embark_town’ column need to be expressed as numeric values:
titanic['embark_town'] = titanic['embark_town'].astype('category') titanic['embark_town'] = titanic['embark_town'].cat.codes titanic['embark_town']
0 2 1 0 2 2 3 2 4 2 .. 886 2 887 2 888 2 889 0 890 1 Name: embark_town, Length: 891, dtype: int8
I will group the data by ‘pclass’ and ‘alive’ and fill up the null values by the median.
titanic['embark_town'] = titanic['embark_town'].fillna(titanic.groupby(['pclass', 'alive'])['embark_town'].transform('median'))
6. Iterative Imputation With a Machine Learning Model
This is a very good and efficient way of imputing the null values. In this process, null values in each column get filled up. Each column is used as the label of a specified machine learning model one by one.
Then the rows with non-null values are used to train a machine learning model and the rows with null values are predicted using that trained model.
For this demonstration, I am using only a few columns. Remember in the beginning we checked which columns have null values and we saw age, embark_town, and deck columns have null values.
The column deck had too many null values though and we wanted to avoid that column. But for the demonstration purpose, I am adding the deck column also in the titanic1 dataset.
I am using RandomForestRegressor here. You can use any other regression model.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.ensemble import RandomForestRegressor titanic1 = titanic[['survived', 'pclass', 'age', 'sibsp', 'fare', 'embark_town']]
Let’s do the imputation now and I will save the imputed dataset as titanic2.
imptr = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0) titanic2 = pd.DataFrame(imptr.fit_transform(titanic1), columns = titanic1.columns) titanic2
The age in 888 row is 32 now!
We should check if there is any null values anymore in titanic2 dataset:
survived 0 pclass 0 age 0 sibsp 0 fare 0 deck 0 embark_town 0 dtype: int64
There are no null values in any of the columns anymore.
These are all the tips I wanted to share about null values today.
If you want you can choose individual techniques for individual columns as well. Please feel free to share if you find any other techniques to deal with null values more efficient.