An Overview of Performance Evaluation Metrics of Machine Learning(Classification) Algorithms in Python
Performance evaluation for machine learning models

An Overview of Performance Evaluation Metrics of Machine Learning(Classification) Algorithms in Python

Performance evaluation is the most important part of machine learning in my opinion. Because machine learning itself has become pretty easy because of all the libraries and packages. Anyone can develop machine learning without knowing much about what is going on behind the scene. Then performance evaluation can be a challenge. How do you evaluate the performance of that machine learning model?

Softwares like Weka provides a lot of performance evaluation parameters automatically as you build the model. But in other tools like sklearn or R packages, performance evaluation parameters do not come automatically with the model. You have to choose the parameters on which you want to evaluate the performance of your model.

So, I decided to write this article that summarizes all the popular performance evaluation metrics for classification models. So, it saves some time for you.

In this article, I will try to explain the performance evaluation metrics for classification models briefly with formulas, simple explanation, and their calculations using a practical example. I will not dive too deep into them as this is an overview or a cheat sheet.

We need a machine learning model on which we will try all our performance evaluation metrics. So, first, we will develop a model and then work on the performance evaluation metrics one by one.

This article performs:

  1. Selection of the features
  2. Model Development
  3. Performance Evaluation Methods

I will use this dataset about arthritis. This dataset contains other parameters and we will use those to predict if a person has arthritis or not. Please feel free to download the dataset from this link.

Let’s focus on model development. But before that, we need to select the features.

Feature Selection

Here is the dataset for this project:

import pandas as pd
import numpy as npdf = pd.read_csv('arthritis.csv')

This DataFrame is so big that I cannot show a screenshot here. It has a total of 108 columns. These are the columns:



Index(['x.aidtst3', 'employ1', 'income2', 'weight2', 'height3', 'children', 'veteran3', 'blind', 'renthom1', 'sex1',
'x.denvst3', 'x.prace1', 'x.mrace1', 'x.exteth3', 'x.asthms1',
'x.michd', 'x.ltasth1', 'x.casthm1', 'x.state', 'havarth3'],
dtype='object', length=108)

The last column ‘havarth3’ is the target variable that we want to predict using the classifier. This column tells us if a person has arthritis or not. It has two values. The value 1 means the person has arthritis and value 2 means the person does not have arthritis.

The rest of the features are the input parameters.

X= df.drop(columns=["havarth3"])
y= df['havarth3']

Using the train_test_split method from the scikit-learn library, split the dataset into a training set and test set:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.34, random_state = 35)

We have the training and test set now. But do we need all 108 training features? Maybe not.

I choose 9 features from this using the SelectKBest function available in the sklearn library.

I did not choose this number 9 arbitrarily. I checked with different other feature selection methods and tried with different numbers even with this feature selection method to finally select the number 9.

Please feel free to check my article on feature selection. I provided the link at the end.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
uni = SelectKBest(score_func = f_classif, k = 9) fit =, y)

These are the columns chosen by the feature selection method above:




These 9 features will be used for the classifier.

Model Development

There are so many classifiers available. I choose a Random Forest Classifier here. It is pretty simple if you use the sklearn library. Simply import the classifier, pass the hyperparameters and fit the training data to it.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=6, random_state=0).fit(reduced_training, y_train)

The classifier is done! Now we will move to our main purpose of doing all this. That is to learn all the performance evaluation metrics.

Performance Evaluation Metrics for Machine Learning


The first one anyone can easily think of is the accuracy rate. For this specific project, we want to know how many people with arthritis and no arthritis were predicted accurately. Let’s check for the training set first.

clf.score(reduced_test, y_test)



It’s 74.17%. Not bad! We kept the test set to test the model. So let’s test:

clf.score(reduced_training, y_train)



It is 74.67% Training and test set accuracy are pretty similar. So there is no overfitting issue here.


A lot of time we need to deal with very sensitive data. For example, if we are working on a dataset that diagnoses cancer patients. It is very important to diagnose it correctly.

Remember accuracy measures can be deceiving in those cases. Because the dataset can be very skewed. Maybe 98% or more of the data are negative. Means 98% or more of the cases, patients do not have cancer and it is negative. And only a small amount of data will be positive. In that case, if our classifier accuracy is 98%, what does that mean?

It could mean that it only classified cancer-negative patients correctly and it couldn’t diagnose any cancer positive patients. But still, accuracy shows 98%. But is the classifier efficient in that case? Not at all.

So it is important to know the percentage of the cancer-positive patients classified correctly and the percentage of the cancer negative patients correctly.

A confusion matrix is a 2×2 matrix of four numbers that provides a breakdown of the results as true positives, false positives, true negatives, and false negatives. Here is the definition:

True Positives: True positive is the number of positive data that are correctly predicted as positives.

False Positives: False-positive shows the amount of data that are actually negative but the classifier classified them as positives.

True Negatives: The number of negative data that are correctly predicted as negatives.

False Negatives: False-negative is the number of positive data that are incorrectly predicted by the classifier as negatives.

You can compare the actual labels and the predicted labels to find out all of those. But here I will import and use the confuion_matrix function.

Below is the function that uses the confusion_matrix function and labels the outputs as the tp (true positive), ‘fp’ (false positive), fn(false negative), and ‘tn’(true negative) using a dictionary.

def confusion_matrix_score(clf, X, y):
    y_pred = clf.predict(X)
    cm = confusion_matrix(y, y_pred)
    return {'tp': cm[0, 0], 'fn': cm[0, 1],
            'fp': cm[1, 0], 'tn': cm[1, 1]}

Next, simply import the confusion_matrix function and use the function above:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix_score(clf, reduced_test, y_test)


{'tp': 692, 'fn': 651, 'fp': 397, 'tn': 2318}

These four parameters will be used to find several other performance evaluation metrics.

True Positive Rate(TPR) and False Positive Rate(FPR)

True positive rate is the true positives (TP) divided by the total positives(P). If we look at the confusion matrix, true positive is already there. But what is the total positives? Total positive is the sum of the true positives and false negatives.

true positive rate(TPR) = true positive / (true positive + false negative)

True positive rate is also known as Sensitivity.

In the same way the false positive rate is the false positives divided by the sum of false positives and true negatives.

false-positive rate(FPR) = false positives / (false positives + true negatives)

Both TPR and FPR are quite significant metrics. If we look at this project, we can find out how many people with arthritis are detected correctly and how many people without arthritis are wrongly detected as arthritis patients. To calculate them, let’s extract tn, fp, fn, and tp from the confusion matrix ‘cm’ we calculated above.

tn, fp, fn, tp = cm['tn'], cm['fp'], cm['fn'], cm['tp']
tpr = tn/(tn+fn) fpr = fn/(fn+tp)


(0.7807342539575615, 0.4847356664184661)

The true positive rate is 78.07% and the false positive rate is 48.47%.

ROC Curve and Area Under the Curve(AUC)

ROC curve(Receiver Operating Characteristic curve) shows the trade-off between the true positive rate(TPR) and the false positive rate(FPR). The TRP and FPR are calculated for different thresholds. Then this series of TPR and FPR are plotted to find the ROC curve. If the area under this curve(AUC) is closer to 1, the model is considered skillful.

So, here is how to find the ROC curve and the area under the curve

from sklearn import metrics
metrics.plot_roc_curve(clf, reduced_test, y_test)

The plot, it shows that the area under the curve (AUC) is 0.8.

Precision and Recal

If we think of this project, precision will calculate what fraction of the arthritis patient the classifier predicted correctly. this formula can be written as :

You can see from the formula that higher precision means higher true positives and lower false positives.

On the other hand, Recall represents the fraction of all the arthritis patients who have arthritis are detected as arthritis patients. If you are seeing this for the first time, the definition may look confusing. Here is the formula:

So, higher recall means higher true positives and lower false negatives. The recall is also called sensitivity.

Let’s calculate precision and recall:

precision = tn/(tn+fn)
recall = tn/(tn+fp)
(precision, recall)


(0.7807342539575615, 0.8537753222836095)


As you can see precision means lower false positives and recall means lower false negatives. When you will optimize a machine learning model, you need to choose which direction you want to go: lower false positives or lower false negatives? That depends on the project requirement.

F1 score is the harmonic mean of precision and recall. The formula looks like this:

Notice the formula. When the precision and recall both are perfect, the f-score is 1.

A more generalized formula of the f-score is:

This formula is used when either precision or recall need to weigh higher than the other. Three commonly used value of beta is 1, 2, or 0.5. 1 is used when precision and recall both weigh the same, 2 is used when recall weighs higher than the precision and 0.5 is used when recall weighs lower than the precision.

I will use a beta of 1 and 2 and calculate an f1 and f2 value for this demonstration:

f1_score = 2*precision*recall/(precision + recall)
f2_score = 5*precision*recall/(4*precision + recall)
(f1_score, f2_score)


(0.8156228008444757, 0.8380938607274568)

Precision-Recall Curve

The Precision-Recall curve shows the tradeoff between precision and recall. A high area under the precision-recall curve means high precision and high recall. For this precision and recall are calculated using different thresholds.

But no problem. We don’t have to calculate precision and recall for different thresholds manually. We can simply use the function available in sklearn library that will provide us the curve and also the area under the curve.

from sklearn.metrics import auc, plot_precision_recall_curve
plot_precision_recall_curve(clf, reduced_test, y_test)


The curve shows that the Average Precision (AP) is 0.89.

Here is the area under the PRC curve:

auc(recall, precision)



Matthews Correlation Coefficient(MCC)

MCC is another great performance evaluation metric for binary classification. It takes into account true positives, false positives, and false negatives. It returns a value between -1 and 1. The value of 1 means a perfect classifier, 0 means it is no better than random guessing and -1 means a total disagreement between the original labels and predicted labels. Here is the formula:

The calculation for this project using python:

mcc = (tn*tp - fn*fp)/np.sqrt((tn+fn)*(tn+fp)*(tp+fn)*(tp+fp))



It can be calculated using the following function as well that takes predicted y and original y as parameters:

from  sklearn.metrics  import matthews_corrcoef 
matthews_corrcoef (y_test, y_pred)




All the metrics are shown as a binary classification setting. But the same metrics can be used on multi-class classification problems as well. The approach is called one-vs-all. Say, you are calculating precision. You set one of the classes as a positive class and the rest of the classes as a negative class. That way the problem becomes binary.

Hope this discussion on performance evaluation helped. Please feel free to follow me on Twitter.

#MachineLearning #DataScience #Python #Programming

Leave a Reply

Close Menu