Taking the Confusion out of the Confusion Matrix
July 26, 2016
[Revised 21 September 2017, 24 December 2019]
Contents
Introduction
Bayes’ theorem
The prior
The likelihood
The model evidence
The posterior
Summary
Joint probabilities
Likelihood ratios
When is a classifier useful?
Performance metric estimation
Effect of prevalence on classifier training and testing
Introduction
I always thought that the confusion matrix was rather aptly named, a reference not so much to the mixed performance of a classifier as to my own bewilderment at the number of measures of that performance. Recently however, I encountered a brief mention of the possibility of a Bayesian interpretation of performance measures, and this inspired me to explore the idea a little further. It’s not that Bayes’ theorem is needed to understand or apply the performance measures, but it acts as an organizing and connecting principle, which I find helpful. New concepts are easier to remember if they fit inside a good story.
Another advantage of taking the Bayesian route is that this forces us to view performance measures as probabilities, which are estimated from the confusion matrix. Elementary presentations tend to define performance metrics in terms of ratios of confusion matrix elements, thereby ignoring the effect of statistical fluctuations.
Bayes’ theorem is not the only way to generate performance metrics. One can also start from joint probabilities or likelihood ratios. The next three subsections describe these approaches one by one. This is followed by a subsection discussing a minimal condition for a classifier to be useful, a subsection on estimating performance measures, and one on the effect of changing the prevalence in the training and testing data sets of a classifier.
Bayes’ theorem
Let’s start from Bayes’ theorem in its general form:
Here represents the observed data and a parameter of interest. On the lefthand side is the posterior probability density of given . On the righthand side, is the likelihood, or conditional probability density of given , and is the prior probability density of . The denominator is called model evidence, or marginal likelihood. One way to think about Bayes’ theorem is that it uses the data to update the prior information about , and returns the posterior .
For a binary classifier is the true class to which instance belongs and can take only two values, say 0 and 1. Bayes’ theorem then simplifies to:
To describe the performance of a classifier, could be summarized either by the class label assigned to it by the classifier, or by the score (a number between 0 and 1) assigned by the classifier to the hypothesis that belongs to class 1. Thus for example, represents the posterior probability that the true class is 1 given a class label of 0, and is the posterior probability that the true class is 1 given that the score is above a prespecified threshold .
Next we’ll take a look at each component of Bayes’ theorem in turn: the prior, the likelihood, the model evidence, and finally the posterior. Each of these maps to a classifier performance measure or a population characteristic.
The prior
To simplify the notation let’s write to signify , for , and similarly with other variables. The first ingredient in the computation of the posterior is the prior . To fix ideas, let’s assume that is the class of interest, generically labeled “positive”; it indicates “effect”, “signal”, “disease”, “fraud”, or “spam”, depending on the context. Then indicates the lack of these things and is generically labeled “negative”. To fully specify the prior we just need , since . The prior probability of drawing a positive instance from the population is called the prevalence. It is a property of the population, not the classifier.
To simplify the notation even further let’s use for and for .
The likelihood
The next ingredient is the conditional probability . For now let’s work with classifier labels and replace by . This leads to the following four definitions:

The true positive rate, also known as sensitivity or recall: ,

The true negative rate, also known as specificity: ,

The falsepositive or TypeIerror rate: , and

The falsenegative or TypeIIerror rate: .
For example, is the conditional probability that the classifier assigns a label of 1 to an instance actually belonging to class 0. Note that the above four rates are independent of the prevalence and are therefore intrinsic properties of the classifier.
When viewed as a function of true class , for fixed label , the conditional probability is known as the likelihood function. This functional aspect of is not really used here, but it is good to remember the terminology.
The model evidence
Finally, we need the normalization factor in Bayes’ theorem, the model evidence. For a binary classifier this has two components, corresponding to the two possible labels:

The positive labeling rate: , and

The negative labeling rate: .
The positive labeling rate is more commonly known as the queue rate, especially when it is calculated as the fraction of instances for which the classifier score is above a prespecified threshold .
Again to simplify the notation I’ll write for and for in the following.
The posterior
Armed with the prior probability, the likelihood, and the model evidence, we can compute the posterior probability from Bayes’ theorem. There are four combinations of truth and label:

The positive predictive value, also known as precision: ,

The negative predictive value: ,

The false discovery rate: , and

The false omission rate: .
The precision, for example, quantifies the predictive value of a positive label by answering the question: If we see a positive label, what is the (posterior) probability that the true class is positive? (Note that the predictive values are not intrinsic properties of the classifier since they depend on the prevalence. Sometimes they are “standardized” by evaluating them at .)
Figure 1 shows how the posterior probabilities , , and are related to the likelihoods , , and via Bayes’ theorem:
Summary
This section introduced twelve quantities:
Only three of these are independent, say , and . To see this, note that by virtue of their definition the first six quantities satisfy three conditions:
and the last six can be expressed in terms of the first six: for and we have:
and Bayes’ theorem takes care of the remaining four (Figure 1).
The final number of independent quantities matches the number of degrees of freedom of a twobytwo contingency table such as the confusion matrix, which is used to estimate all twelve quantities (see below). It also tells us that we only need two numbers to fully characterize a binary classifier (since is not a classifier property). As trivial examples, consider the majority and minority classifiers. Assume that class 0 is more prevalent than class 1. Then the majority classifier assigns the label 0 to every instance and has and . The minority classifier assigns the label 1 to every instance and has and . These classifiers are of course not very useful. As we will show below, for a classifier to be useful, it must satisfy .
Joint probabilities
A couple of joint probabilities lead to performance metrics with very different properties.
Consider first the joint probability of class label and true class, which can be decomposed in terms of likelihood and prior:
Projecting this joint probability on the axis “” yields the accuracy, or the probability to correctly identify any instance, negative or positive:
The last expression on the right shows that the accuracy depends on the prevalence and is therefore not an intrinsic property of the classifier. An equivalent measure is the misclassification rate, defined as one minus the accuracy. A benchmark that is sometimes used is the null error rate, defined as the misclassification rate of a classifier that always predicts the majority class. It is equal to . The complement of the null error rate is the null accuracy, which is equal to .
The second joint probability to consider is that of the scores assigned by the classifier to two independently drawn instances, one from the positive class and one from the negative class. A good measure of performance is the probability that the positive instance is scored higher than the negative instance. This measure equals the Area Under the Receiver Operating Characteristic (AUROC), a curve of the true positive rate as a function of the false positive rate (or sensitivity versus oneminusspecificity). By construction, the AUROC is independent of prevalence, and does not require a choice of threshold on the classifier score . In other words, it does not depend on the chosen operating point of the classifier.
There is actually a relationship between the AUROC and the accuracy, via an expectation. Imagine an ensemble of binary classifiers like the one considered here, but with thresholds distributed like the scores of the instances in the population we are trying to classify. The expected (i.e. average) accuracy of the classifiers in this ensemble is given by:
and is thus linearly related to the AUROC. In a sense, the AUROC aggregates the classifier accuracy information in a way that’s independent of prevalence (see this answer by Peter Flach on Quora for a more detailed explanation).
Likelihood ratios
In some fields it is customary to work with ratios of probabilities rather than with the probabilities themselves. Thus one defines:

The positive likelihood ratio: , which represents the odds of a positive label if the true class is positive (in medical statistics for example, this would be the odds of a positive test if patient has the disease).

The negative likelihood ratio: , which represents the odds of a positive label if the true class is negative (in medical statistics, the odds of a positive test if patient does not have the disease).
Finally, it is instructive to take the ratio of these two likelihood ratios. This yields
 The diagnostic odds ratio: .
If , the classifier is deemed useful, since this means that, in our medical example, the odds of a positive test are higher for a patient with the disease than for a healthy patient. If the classifier is useless, but if the classifier is worse than useless, it is misleading. In this case one would be better off swapping the labels on the classifier output.
Note that the ratios , , and are all independent of prevalence.
When is a classifier useful?
The usefulness condition is mathematically equivalent to any of the following conditions:

The sensitivity must be larger than one minus the specificity. Equivalently, the probability of detecting a positiveclass instance must be larger than the probability of mislabeling a negativeclass instance. This condition explains why, for a useful classifier, the ROC always lies above the main diagonal in a plot of versus .

A reformulation of condition 1 in terms of TypeI and II error rates.

The precision must be larger than the prevalence. In other words, the probability for an instance to belong to the positive class must be larger within the subset of instances with a positive label than within the entire population. If this is not the case, the classifier adds no useful information. Similarly: .

This follows from condition 3 and Bayes’ theorem for . The probability of encountering a positively labeled instance must be larger within the subset of positiveclass instances than within the entire population. Similarly: , , and .

The probability of encountering a negativeclass instance must be smaller within the subset of positively labeled instances than within the entire population. Similarly: .
The classifier usefulness condition also puts a bound on the accuracy:
It is straightforward to verify that the majority and minority classifiers defined earlier both fail any of the above usefulness conditions. As a counterexample, imagine starting with the majority classifier and flipping the label on a single positive instance. Thus, this classifier sets on one instance and on all other instances. It has 100% precision on positive labels (). For negative labels the precision is , with the total number of instances. Hence this classifier passes the usefulness condition, although barely in the case of and large .
Performance metric estimation
The performance metrics can be estimated from a twobytwo contingency table known as the confusion matrix. It is obtained by applying the classifier to a test data set different from the training data set. Here is standard notation for this matrix:
The rows correspond to class labels, the columns to true classes. We then have the following approximations:
There is no simple expression for the AUROC, since the latter requires integration under the curve of true versus false positive rates.
Note that the relations between probabilities derived in the previous sections also hold between their estimates.
Effect of prevalence on classifier training and testing
In this post I have tried to be careful in pointing out which performance metrics depend on the prevalence, and which don’t. The reason is that prevalence is a population or sample property, whereas here we are interested in classifier properties. Of course, realworld classifiers must be trained and tested on finite samples, and this has implications for the actual effect of prevalence on classifier performance:

When training a classifier, changing the prevalence in the training data may affect properties such as the sensitivity and specificity, depending on the characteristics of the training algorithm.

When testing a trained classifier, changing the prevalence in the test data will not affect properties such as the sensitivity and specificity (within statistical fluctuations).
To illustrate this point I went back to a previous post on predicting flight delays with a random forest. The prevalence of the given flight delay data set is about 20%, but can be increased by randomly dropping class 0 instances (nondelayed flights). The table below shows the effect of changing the prevalence in this way in the training data:
Training set prevalence  Sensitivity  Specificity  Accuracy  AUROC 

0.40  0.16  0.96  0.72  0.72 
0.42  0.29  0.91  0.72  0.72 
0.44  0.46  0.82  0.71  0.72 
0.46  0.58  0.72  0.68  0.71 
0.48  0.63  0.67  0.66  0.72 
0.50  0.73  0.57  0.62  0.71 
To be clear, for each row it is the same classifier (same hyperparameters: number of trees, maximum tree depth, etc.) that is being fit to a training data set with varying prevalence. The sensitivity, specificity, accuracy, and AUROC are then measured on a testing data set with a constant prevalence of 30%. Note how the sensitivity and specificity, as measured on an independent test data set, vary with training set prevalence. The AUROC, on the other hand, remains stable.
Next, let’s look at the effect of changing the prevalence in the testing data:
Testing set prevalence  Sensitivity  Specificity  Accuracy  AUROC 

0.20  0.50  0.78  0.73  0.72 
0.30  0.48  0.80  0.70  0.72 
0.40  0.45  0.82  0.67  0.72 
0.50  0.46  0.80  0.63  0.71 
0.60  0.50  0.79  0.61  0.71 
Here the prevalence in the training data was set at 0.45. The sensitivity and specificity appear much more stable; residual variations may be due to the effect on the training data set of the procedure used to vary the testing set prevalence.
Training data sets are often unbalanced, with prevalences even lower than in the example discussed here. Modules such as scikitlearn’s RandomForestClassifier
provide a class_weight
option, which can be set to balanced
or balanced_subsample
in order to assign a larger weight to the minority class. This reweighting is applied both to the GINI criterion used for finding splits at decision tree nodes and to class votes in the terminal nodes. The sensitivity and specificity of a classifier, as well as its accuracy, will depend on the setting of the class_weight
option. For scoring purposes when optimizing a classifier, AUROC tends to be a much more robust property.