Decision Trees in Python Sklearn —  Classification
Decision Tree Classifier - sklearn

Decision Trees in Python Sklearn —  Classification

Decision Tree is a very popular algorithm for machine learning because it has shown very good success for predictive modeling and also it is easy to explain to stakeholders. This is also the basis for other tree based algorithms. If you are interested in knowing a bit details of how decision tree algorithm works, please check this blog post.

Simple Explanation on How Decision Tree Algorithm Makes Decisions — Regenerative (regenerativetoday.com)

This tutorial will focus on the implementation of decision tree algorithm for classification problems. 

We will also learn to use GridSearchCV method to find the right hyperparameters. 

Data Preparation

We will use a very clean and organized dataset for this tutorial from Kaggle. So, not too much data processing necessary here. Please feel free to download the dataset from here:

Heart Failure Prediction (kaggle.com)

But still, we need to work on some common data preparation tasks. 

First, read the dataset into a DataFrame format in Pandas:

import pandas as pd 
import numpy as np
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

The dataset has different health parameters like age, anemia, creatinine phosphokinase, diabetes, ejection fraction, high blood pressure, plateletes, serum_creatinine, serum_sodium, sex, smoking, time, and DEATH_EVENT. 

The dataset is pretty clean. But still I always prefer to check for null values at least. Because null values will give you error message during the training process.

df.isna().sum()

Output:

age                         0
anaemia 0
creatinine_phosphokinase 0
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
DEATH_EVENT 0
dtype: int64

There are zero null values in all the columns of the dataset.

Checking the Tatget Variable

The purpose of this model will be to predict the DEATH_EVENT using all the other features. So, the target variable is DEATH_EVENT. Let’s check the values in DEATH_EVENT column.

 df.DEATH_EVENT.value_counts()

Output:

0    203
1 96
Name: DEATH_EVENT, dtype: int64

There are two values in DEATH_EVENT: 0 and 1. So this is going to be a binary classification. 

Define the training features and labels first. As DEATH_EVENT is the labels, we will use rest of the features as training features. If we drop the DEATH_EVENT from the df then we will get our training features. 

X = df.drop(columns=['DEATH_EVENT'])
y = df['DEATH_EVENT']

You cannot use all the data for training. Because once the training is done we will need data to check the performance of the model as well. So the dataset needs to be split up for training and testing purpose.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)

Now model training part which is pretty simple.

Model Development

First, we need to import DecisionTreeClassifier from sklearn.tree. Then call the model and pass the necessary parameters. In the beginning, it will be interesting to see how the model performs with the default parameters. Then we will try passing different parameters. 

And then fit the training data to the model. 

from sklearn.tree import DecisionTreeClassifier

 

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

The model training is done. Notice, we didn’t pass any parameters in the DecisionTreeClassifier method. That means we are accepting all the default parameters. 

At this point, we should check the performance of the model. I will check the accuracy score on both training and test data. 

dt.score(X_train, y_train)

Output:

1.0

The accuracy score on training data becomes 1.0 or 100%. 

dt.score(X_test, y_test)

Output:

dt.score(X_test, y_test)

Output:

0.708

On test data the accuracy score becomes 0.708 or 70.8%. 

The model predicts with 100% accuracy on training data and only 70.8% accuracy on the test data. This is serious overfitting problem. 

So, the default parameters didn’t do us that good. We should try some different parameters. Please check the documentation for the details about the parameters in decision tree classifier.

 I will check with two core parameters. I want to check max_depth of 3, 4, 5, 6, 7 and max_leaf_nodes of 2, 3, 4, 5, 6. Checking all the parameters one by one or the combination f all the max_depth and max_leaf_nodes one by one may become tedious. To avoid trying all the parameters one by one it will be efficient to use GridSearchCV method from skelarn library. 

GridSearchCV method can accept multiple values for each parameter and then get the combination of best parameters from them.

First import GridSearchCV method and make a dictionary with the parameters where the keys are the names of the parameteres and the values are the lists of the values of those parameters.

from sklearn.model_selection import GridSearchCV 
parameters = {'max_depth': [3, 4, 5, 6, 7], 'max_leaf_nodes': [2, 3, 4, 5, 6]}

Now, call the model again and pass the model and the parameters GridSearchCV. 

dt2 = DecisionTreeClassifier()
dt2 = GridSearchCV(dt2, parameters)

Fitting the training data to this new model:

dt2.fit(X_train, y_train)

Checking the prediction accuracy on both training and testing data:

print("training data score: " + str(dt2.score(X_train, y_train)))
print('test data score ' + str(dt2.score(X_test, y_test)))

Output:

training data score: 0.8603351955307262
test data score 0.825

As you can see, this time the accuracy is scores on training and test data are much closer, 86% and 82% which is a big improvement. 

We passed several parameters for max_depth and max_leaf_nodes. It will be nice to see our GridSearchCV method found out to be the best amongst them:

dt2.best_params_

Output:

{'max_depth': 3, 'max_leaf_nodes': 2}

That’s pretty much a decision tree model development for a classification task. 

If you want the video version, here is it:

#machinelearning #artificialintelligence #datasceince #python #sklearn #decisiontree #classification 

Leave a Reply

Close Menu