Published: October 2, 2023
Evaluating the performance of a classifier is an important step in the process of creating and training classification models. In this post, we will consider a range of methods which can be extended to the multiclass problem.
The tools include:
A confusion matrix is a matrix of the counts for the true and false positives and negatives for a given binary or multi-class classification problem. For example, you could count the number of ways that the classifier confused instances of class A being class B.
Typically, you would use the confusion matrix to evaluate the performance of predictions on a validation set of data. The test set should be reserved to check your classifier right at the end of the project before launch.
It is a useful tool for giving a quick and informative view of the performance of a classifier as it helps to inform the tuning of hyperparameters which you would do to improve the performance of a model.
In this example, we can see a very high percentage (9025) of observations which are true negative are predicted as positive. Realistically, you would see a bigger majority of counts in the diagonal, with smaller counts in the upper right and lower left quadrants.
1from sklearn.metrics import confusion_matrix
2confusion_matrix(y_train, y_train_pred)
(Source: Aurélien Géron, 2019)
The individual counts for the values in the confusion matrix are useful, however, there are more concise measures such as precision and recall.
1from sklearn.metrics import precision_score, recall_score
2precision_score(y_train, y_train_pred)
3recall_score(y_train, y_train_pred)
(Source: Aurélien Géron, 2019)
In the context of the breast cancer example where we are predicting if a tumour is malignant or benign, then precision answers the questions:
"how well does the model predict a benign diagnosis when the patient is benign?" or,
"how well does it predict malignant when they are malignant?"
and recall answers the questions:
"how well does the model capture benign patients when they are actually benign? or,
"how well does it capture malignant patients who are actually malignant?"
Plotting the precision and recall is helpful when deciding on a threshold value. First, you get the decision scores and then plot the precision and recall as functions of the threshold value.
(Source: Aurélien Géron, 2019, p. 97)
1from sklearn.metrics import precision_recall_curve
2precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
3
4def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
5plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
6plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')
7[...] # highlight the threshold, add the legend, axis label and grid
8
9plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
10plt.show()
(Source: Aurélien Géron, 2019)
This curve is good for imbalanced problems. This graph shows an approximate 90% precision. You would want to choose a value before the sharp drop at around 80% recall, maybe around the 60% recall mark.
(Source: Aurélien Géron, 2019, p. 98)
The ROC curve is used with binary classifiers. Unlike the precision vs recall curve, it plots the true positive rate (recall/sensitivity) against the false positive rate (1 – specificity) for all possible thresholds.
(Source: Aurélien Géron, 2019, p. 100)
The dotted line in the middle represents a classifier that is no better than a random guess. The closer the curve is to the left corner, the better the classifier.
1from sklearn.metrics import roc_curve
2fpr, tpr, thresholds = roc_curve(y_train, y_scores)
3
4def plot_roc_curve(fpr, tpr, label=None):
5plt.plot(fpr, tpr, linewidth=2, label=label)
6plt.plot([0, 1], [0, 1], 'k--') # dashed diagonal
7[...] # Add axis labels and grid
8
9plot_roc_curve(fpr, tpr)
10plt.show()
(Source: Aurélien Géron, 2019)
The PR curve should be preferred whenever the positive class is rare or when you care more about the false positives than the false negatives. The ROC curve can be used for all other situations.
The AUC score is used to measure the goodness of a classifier and is based on the ROC. For example, an AUC score closer to 1 indicates a good classifier.
1roc_auc_score(y_train, y_scores_forest)
(Source: Aurélien Géron, 2019)
The classification report includes the precision and recall scores and other calculations like the F1 score, and macro/micro averages.
1from sklearn.metrics import classification_report
2classification_report(y, y_pred)
The F1 score combines precision and recall into a single measure and is useful for comparing classifiers. The F1 score is the harmonic mean which gives more weight to low values. Consequently, a classifier will only get a high F1 score if both recall, and precision is high.
Keep in mind that the F1 score will be biased toward classifiers that have similar precision and recall. Depending on the context of your problem, this is not always ideal. For example, you might want to have a high precision or a high recall.
To illustrate this point, you might have a situation where you have a classifier that detects videos that are safe for children. Therefore, you might prefer a model that rejects many good videos (low recall) but keeps the safe ones (high precision) over a model that has a high recall but lets a few bad videos show up in your product.
Another example, you might have a classifier that detects shoplifters on surveillance. It is probably okay that you have a low precision but a very high recall. The security guards will have a few false alerts but almost all shoplifters will be caught.