Developing an efficient machine learning algorithm with a skewed dataset can be tricky. For example, the dataset is about fraudulent activities in the bank or cancer detection. What happens is you will see in the dataset that, 99% of the time there are no fraudulent activities or there is no cancer. You can easily cheat and just predict 0 all the time (predicting 1 if cancer and 0 if no cancer) to get a 99% accuracy. If we do that we will have a 99% accurate machine learning algorithm but we will never detect cancer. If someone has cancer, s/he will never get treatment. In the bank, there will be no action against fraudulent activities. So, accuracy alone cannot decide for a skewed dataset like that if the algorithm is working efficiently or not.
Background
There are different evaluation matrices that can help with these types of datasets. Those evaluation metrics are called precision-recall evaluation metrics.
To learn the precision and recall, you need to understand the following table and all its terms. Consider a binary classification. It will either return 0 or 1. For a given training data, if the actual class is 1 and the predicted class is also 1, that is called a true positive. If the actual class is 0 and the predicted class comes out to be 1, that is a false positive. If the actual class is 1 but the predicted class is a 0, it is called a false negative. If both the actual class and the predicted class are 0s, that is a true negative.

Using all these, we are going to calculate the precision and recall.
Precision
Precision calculates, what fraction of the transactions we predicted as fraudulent(predicted class 1) are actually fraudulent. Precision can be calculated using the following formula:

Further breaking down, this formula can be written as :

You can see from the formula that higher precision is good. Because higher precision means more true positives. That means when we are saying that this transaction is fraudulent, it is true.
Recall
Recall tells us, what fraction of all the transactions that are originally fraudulent are detected as fraudulent. That means when a transaction is actually fraudulent if we told the proper authority of the bank to take action. When I first read these definitions of precision and recall, it took me some time to really understand the difference. I hope you are getting it faster. If not, then don’t worry. You are not alone.
Recall can be calculated by the following formula:

Expressing by the term defined in the 2 x 2 table above:

Making Decisions From Precision And Recall
The precision and recall give a better sense of how an algorithm is actually doing, especially when we have a highly skewed dataset. If we predict 0 all the time and get 99.5% accuracy, the recall and precision both will be 0. Because there are no true positives. So, you know that classifier is not a good classifier. When the precision and recall both are high, that is an indication that the algorithm is doing very well.
Suppose we want to predict y = 1 when we are highly confident only. Because there are cases when it is very important. Especially when we are dealing with medical data. Assume that we are detecting if someone has heart disease or cancer. Predicting a false positive can bring a lot of pain in a person’s life. As a reminder, generally, logistic regression predicts 1 if the hypothesis is greater than or equal to 0.5 and predicts 0 if the hypothesis is less than 0.5.
Predict 1 if hypothesis ≥ 0.5
Predict 0 if hypothesis < 0.5
But when we deal with some sensitive situation as mentioned we want to be more sure about our result, we predict 1 if hypothesis ≥ 0.7 and predict 0 if the hypothesis < 0.7. If you want to be even more confident about your result, you can see those values like 0.9. So you will be 90% certain that somebody has cancer or not.
Now, have a look at the precision and recall formula. Both true positives and false positives will be lower. So, precision will be higher. But on the other hand, false negatives will be higher because we will predict more negatives now. In that case, the recall will be higher. But too many false negatives are also not good. If someone actually has cancer or an account has a fraudulent activity but we are telling them that they do not have cancer or the account does not have fraudulent activity, that could lead to a disaster.
To avoid false negatives and achieve higher recall, we need to change the threshold to something like this:
Predict 1 if hypothesis ≥ 0.3
Predict 0 if hypothesis < 0.3
As opposed to the previous case, we will have higher recall and lower precision.
How to decide the threshold then? It will depend on what your requirement is. Based on your dataset, you have to decide if you need higher precision or higher recall. Here is the precision-recall curve:

The precision-recall curve can be of any shape. So, I am showing three different shapes here. If you cannot decide for yourself if you need higher precision or higher recall, you can use the F1 score.
F1 Score
F1 score is the average of precision and recall. But the formula for average is different. The regular average formula does not work here. Look at the average formula:
(Precision + Recall) / 2
Even if the precision is 0 or recall is zero the average is still 0.5. Remember from our previous discussion, what does it mean to have a precision is zero. We can always predict y = 1. So, that should not be acceptable. Because the whole precision-recall idea is to avoid that. The formula F1 score is:

Here, P is precision and R is the recall. If the precision is zero or recall is zero, the F1 score will be zero. So, you will know that the classifier is not working as we wanted. When the precision and recall both are perfect, that means precision is 1 and recall is also 1, the F1 score will be 1 also. So, the perfect F1 score is 1. It is a good idea to try with different thresholds and calculate the precision, recall, and F1 score to find out the optimum threshold for your machine learning algorithm.
Conclusion
In this article, you learned how to deal with a skewed dataset. How to choose between precision and recall using an F1 score. I hope it was helpful.
#MachineLearning #DataScience #ArtificialInteligence #Technology