How to use confusion matrices, ROC curves, and other methods to evaluate binary classifiers

Published: October 2, 2023

Intro

Evaluating the performance of a classifier is an important step in the process of creating and training classification models. In this post, we will consider a range of methods which can be extended to the multiclass problem.

The tools include:

Confusion matrix
ROC curve
Precision-recall curve
AUC score
Classification report

Confusion matrix

What is it?

A confusion matrix is a matrix of the counts for the true and false positives and negatives for a given binary or multi-class classification problem. For example, you could count the number of ways that the classifier confused instances of class A being class B.

When would you use it?

Typically, you would use the confusion matrix to evaluate the performance of predictions on a validation set of data. The test set should be reserved to check your classifier right at the end of the project before launch.

Why would you use it?

It is a useful tool for giving a quick and informative view of the performance of a classifier as it helps to inform the tuning of hyperparameters which you would do to improve the performance of a model.

An example

In this example, we can see a very high percentage (9025) of observations which are true negative are predicted as positive. Realistically, you would see a bigger majority of counts in the diagonal, with smaller counts in the upper right and lower left quadrants.

Python code

1from sklearn.metrics import confusion_matrix
2confusion_matrix(y_train, y_train_pred)

(Source: Aurélien Géron, 2019)

Precision and Recall

The individual counts for the values in the confusion matrix are useful, however, there are more concise measures such as precision and recall.

Precision – a measure of true predictions
Recall – a measure of the proportion of true predictions found

Python code

1from sklearn.metrics import precision_score, recall_score
2precision_score(y_train, y_train_pred) 
3recall_score(y_train, y_train_pred)

(Source: Aurélien Géron, 2019)

In the context of the breast cancer example where we are predicting if a tumour is malignant or benign, then precision answers the questions:

"how well does the model predict a benign diagnosis when the patient is benign?" or,

"how well does it predict malignant when they are malignant?"

and recall answers the questions:

"how well does the model capture benign patients when they are actually benign? or,

"how well does it capture malignant patients who are actually malignant?"

The precision/recall trade-off

Plotting the precision and recall is helpful when deciding on a threshold value. First, you get the decision scores and then plot the precision and recall as functions of the threshold value.

(Source: Aurélien Géron, 2019, p. 97)

1from sklearn.metrics import precision_recall_curve
2precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
3
4def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
5plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
6plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')
7[...] # highlight the threshold, add the legend, axis label and grid
8
9plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
10plt.show()

(Source: Aurélien Géron, 2019)

Precision vs recall (PR) curve

This curve is good for imbalanced problems. This graph shows an approximate 90% precision. You would want to choose a value before the sharp drop at around 80% recall, maybe around the 60% recall mark.

(Source: Aurélien Géron, 2019, p. 98)

Receiver operating characteristic (ROC) curve

The ROC curve is used with binary classifiers. Unlike the precision vs recall curve, it plots the true positive rate (recall/sensitivity) against the false positive rate (1 – specificity) for all possible thresholds.

(Source: Aurélien Géron, 2019, p. 100)

Specificity – the ratio of true negatives
Sensitivity – the ratio of true positives

The dotted line in the middle represents a classifier that is no better than a random guess. The closer the curve is to the left corner, the better the classifier.

1from sklearn.metrics import roc_curve
2fpr, tpr, thresholds = roc_curve(y_train, y_scores)
3
4def plot_roc_curve(fpr, tpr, label=None):
5plt.plot(fpr, tpr, linewidth=2, label=label)
6plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
7[...] # Add axis labels and grid
8
9plot_roc_curve(fpr, tpr)
10plt.show()

(Source: Aurélien Géron, 2019)

PR vs ROC curves

The PR curve should be preferred whenever the positive class is rare or when you care more about the false positives than the false negatives. The ROC curve can be used for all other situations.

Area under the curve (AUC) score

The AUC score is used to measure the goodness of a classifier and is based on the ROC. For example, an AUC score closer to 1 indicates a good classifier.

1roc_auc_score(y_train, y_scores_forest)

(Source: Aurélien Géron, 2019)

Classification report

The classification report includes the precision and recall scores and other calculations like the F1 score, and macro/micro averages.

1from sklearn.metrics import classification_report
2classification_report(y, y_pred)

F1 score

The F1 score combines precision and recall into a single measure and is useful for comparing classifiers. The F1 score is the harmonic mean which gives more weight to low values. Consequently, a classifier will only get a high F1 score if both recall, and precision is high.

Keep in mind that the F1 score will be biased toward classifiers that have similar precision and recall. Depending on the context of your problem, this is not always ideal. For example, you might want to have a high precision or a high recall.

To illustrate this point, you might have a situation where you have a classifier that detects videos that are safe for children. Therefore, you might prefer a model that rejects many good videos (low recall) but keeps the safe ones (high precision) over a model that has a high recall but lets a few bad videos show up in your product.

Another example, you might have a classifier that detects shoplifters on surveillance. It is probably okay that you have a low precision but a very high recall. The security guards will have a few false alerts but almost all shoplifters will be caught.

References

Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O'Reilly.
Lindholm, A., Wahlström, N., Lindsten, F., & Schön, T. (2022). MACHINE LEARNING - A First Course for Engineers and Scientists. Cambridge University Press.