Decision Tree Classifier - sklearn

Decision Trees in Python Sklearn —  Classification

Decision Tree is a very popular algorithm for machine learning because it has shown very good success for predictive modeling and also it is easy to explain to stakeholders. This is also the basis for other tree based algorithms. If you are interested in knowing a bit details of how decision tree algorithm works, please check this blog post.

Simple Explanation on How Decision Tree Algorithm Makes Decisions — Regenerative (regenerativetoday.com)

This tutorial will focus on the implementation of decision tree algorithm for classification problems.

We will also learn to use GridSearchCV method to find the right hyperparameters.

Data Preparation

We will use a very clean and organized dataset for this tutorial from Kaggle. So, not too much data processing necessary here. Please feel free to download the dataset from here:

Heart Failure Prediction (kaggle.com)

But still, we need to work on some common data preparation tasks.

First, read the dataset into a DataFrame format in Pandas:

`import pandas as pd import numpy as npdf = pd.read_csv('heart_failure_clinical_records_dataset.csv')`

The dataset has different health parameters like age, anemia, creatinine phosphokinase, diabetes, ejection fraction, high blood pressure, plateletes, serum_creatinine, serum_sodium, sex, smoking, time, and DEATH_EVENT.

The dataset is pretty clean. But still I always prefer to check for null values at least. Because null values will give you error message during the training process.

`df.isna().sum()`

Output:

`age                         0anaemia                     0creatinine_phosphokinase    0diabetes                    0ejection_fraction           0high_blood_pressure         0platelets                   0serum_creatinine            0serum_sodium                0sex                         0smoking                     0time                        0DEATH_EVENT                 0dtype: int64`

There are zero null values in all the columns of the dataset.

Checking the Tatget Variable

The purpose of this model will be to predict the DEATH_EVENT using all the other features. So, the target variable is DEATH_EVENT. Let’s check the values in DEATH_EVENT column.

` df.DEATH_EVENT.value_counts()`

Output:

`0    2031     96Name: DEATH_EVENT, dtype: int64`

There are two values in DEATH_EVENT: 0 and 1. So this is going to be a binary classification.

Define the training features and labels first. As DEATH_EVENT is the labels, we will use rest of the features as training features. If we drop the DEATH_EVENT from the df then we will get our training features.

`X = df.drop(columns=['DEATH_EVENT'])y = df['DEATH_EVENT']`

You cannot use all the data for training. Because once the training is done we will need data to check the performance of the model as well. So the dataset needs to be split up for training and testing purpose.

`from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)`

Now model training part which is pretty simple.

Model Development

First, we need to import DecisionTreeClassifier from sklearn.tree. Then call the model and pass the necessary parameters. In the beginning, it will be interesting to see how the model performs with the default parameters. Then we will try passing different parameters.

And then fit the training data to the model.

`from sklearn.tree import DecisionTreeClassifier`

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

The model training is done. Notice, we didn’t pass any parameters in the DecisionTreeClassifier method. That means we are accepting all the default parameters.

At this point, we should check the performance of the model. I will check the accuracy score on both training and test data.

`dt.score(X_train, y_train)`

Output:

`1.0`

The accuracy score on training data becomes 1.0 or 100%.

`dt.score(X_test, y_test)`

Output:

`dt.score(X_test, y_test)`

Output:

`0.708`

On test data the accuracy score becomes 0.708 or 70.8%.

The model predicts with 100% accuracy on training data and only 70.8% accuracy on the test data. This is serious overfitting problem.

So, the default parameters didn’t do us that good. We should try some different parameters. Please check the documentation for the details about the parameters in decision tree classifier.

I will check with two core parameters. I want to check max_depth of 3, 4, 5, 6, 7 and max_leaf_nodes of 2, 3, 4, 5, 6. Checking all the parameters one by one or the combination f all the max_depth and max_leaf_nodes one by one may become tedious. To avoid trying all the parameters one by one it will be efficient to use GridSearchCV method from skelarn library.

GridSearchCV method can accept multiple values for each parameter and then get the combination of best parameters from them.

First import GridSearchCV method and make a dictionary with the parameters where the keys are the names of the parameteres and the values are the lists of the values of those parameters.

`from sklearn.model_selection import GridSearchCV parameters = {'max_depth': [3, 4, 5, 6, 7], 'max_leaf_nodes': [2, 3, 4, 5, 6]}`

Now, call the model again and pass the model and the parameters GridSearchCV.

`dt2 = DecisionTreeClassifier()dt2 = GridSearchCV(dt2, parameters)`

Fitting the training data to this new model:

`dt2.fit(X_train, y_train)`

Checking the prediction accuracy on both training and testing data:

`print("training data score: " + str(dt2.score(X_train, y_train)))print('test data score ' + str(dt2.score(X_test, y_test)))`

Output:

`training data score: 0.8603351955307262test data score 0.825`

As you can see, this time the accuracy is scores on training and test data are much closer, 86% and 82% which is a big improvement.

We passed several parameters for max_depth and max_leaf_nodes. It will be nice to see our GridSearchCV method found out to be the best amongst them:

`dt2.best_params_`

Output:

`{'max_depth': 3, 'max_leaf_nodes': 2}`

That’s pretty much a decision tree model development for a classification task.

If you want the video version, here is it:

#machinelearning #artificialintelligence #datasceince #python #sklearn #decisiontree #classification