Model evaluation metrics

Ampatishan Sivalingam

Published in

Artificial Intelligence in Plain English

4 min readMay 20, 2020

source : byjus.com/chemistry/accuracy-and-precision-difference/

In this article I will cover the following:

Why accuracy can’t be used to indicate a model’s performance in all scenarios
What is a confusion matrix
What is precision
What is recall
How do precision and recall differ from accuracy and among themselves

Consider a situation where you have to identify fraudulent bank transactions, out of 100,000 transactions only 2 or 3 will be fraudulent. If our model classifies all of the transactions as legit, then the accuracy of the model will be above 99%. It won’t be a good model

If you consider a problem of classifying the images of cats and dogs, where there are 4,000 images of cats and 3,000 images of dogs, and if a model classifies them with an accuracy of 95%, then the model’s performance good.

This doesn’t mean that accuracy is not an indication of a model’s performance. The fact is, for certain scenarios where the classes are nearly evenly distributed, like the cat and dog classification where both classes have the nearly same number of data, accuracy will be a good measure for the performance of the model, in scenarios where the number of data for different classes is greatly different, accuracy will not be a good measure for model’s performance. These types of scenarios are known as a class imbalance.

There are some other measures: recall, precision, f1 score which can be used along with accuracy to measure the performance of a model. These measures can also be used with scenarios where the classes are balanced. Before looking into these measures we have to get familiar with some terminologies.

source: www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

In the figure above the prediction stands for whether the output of the model is yes or no and “actual” stands for real scenario whether yes or no. There can be four different types of situations based on the model’s output.

If we consider the fraudulent transaction classifier, the input being non-fraudulent and the model predicting it as non-fraudulent: TrueNegative, the input being non-fraudulent and the model predicting it as fraudulent: FalsePositive, the input being fraudulent and the model predicting it as non-fraudulent: FalseNegative and input being fraudulent and the model predicting it as fraudulent: TruePostive.

For the ease of remembering the meaning, you can think TrueNegative as true, predicted negative, where the word before predicted says whether the prediction is correct or not and the word after predicts says whether the prediction is positive or not. The structure shown in the above figure is known as the confusion matrix. The confusion matrix gives us the number of records for all the four cases.

By using the confusion matrix we can measure the performance of the model precisely. Consider the bank transactions again, if the model works well, it should be able to identify the fraudulent transactions from legit transactions. i.e it should be able to identify most of the fraudulent transactions. That means the following should be high:

If we translate the above expression in terms of the newly learned terminology.

This is defined as recall in machine learning. Recall stands for the ratio between the correctly identified positives and total positives. In the bank transaction scenario, it stands for the ratio between the number of correctly classified fraudulent records and the total number of fraudulent records. That means if the recall is high then the ability of the model to correctly identify fraudulent transactions is also high.

Having a good recall only will not guarantee a good model’s performance. Considering the scenario of fraudulent transaction classification, having a higher recall is good. But there can be another catch, consider the situation where all of the bank transactions are being classified as fraudulent. Then as per the recall equation, the recall will be high. This approach won’t help us to improve the model.

We have to pay attention to the percentage of records being wrongly classified. i.e the model will falsely classify every record as Positive to increase the recall. So attention should be paid for the records which are wrongly classified as positive in other words FalsePositive.

This measure will decrease if the model wrongly classifies legit transactions as fraudulent. The measure can be defined in other words as:

Which is defined as precision in Machine Learning. Precision measures the percentage of correctly labeled positives out of the total predicted positives. As in the above-mentioned scenario if the model tries to classify all the transactions as fraudulent to keep the recall high, then the precision will go down.

The relationships between precision, recall, accuracy, and other measures like f1 score and others will be covered in the next article.

Model evaluation metrics

In this article I will cover the following:

Written by Ampatishan Sivalingam