Taking the Confusion out of the Confusion Matrix

# Taking the Confusion out of the Confusion Matrix

July 26, 2016

[Revised 21 September 2017, 24 December 2019]

## Introduction

I always thought that the confusion matrix was rather aptly named, a reference not so much to the mixed performance of a classifier as to my own bewilderment at the number of measures of that performance. Recently however, I encountered a brief mention of the possibility of a Bayesian interpretation of performance measures, and this inspired me to explore the idea a little further. It’s not that Bayes’ theorem is needed to understand or apply the performance measures, but it acts as an organizing and connecting principle, which I find helpful. New concepts are easier to remember if they fit inside a good story.

Another advantage of taking the Bayesian route is that this forces us to view performance measures as probabilities, which are estimated from the confusion matrix. Elementary presentations tend to define performance metrics in terms of ratios of confusion matrix elements, thereby ignoring the effect of statistical fluctuations.

Bayes’ theorem is not the only way to generate performance metrics. One can also start from joint probabilities or likelihood ratios. The next three subsections describe these approaches one by one. This is followed by a subsection discussing a minimal condition for a classifier to be useful, a subsection on estimating performance measures, and one on the effect of changing the prevalence in the training and testing data sets of a classifier.

## Bayes’ theorem

Let’s start from Bayes’ theorem in its general form:

Here $X$ represents the observed data and $\lambda$ a parameter of interest. On the left-hand side is the posterior probability density of $\lambda$ given $X$. On the right-hand side, $p(X\mid\lambda)$ is the likelihood, or conditional probability density of $X$ given $\lambda$, and $\pi(\lambda)$ is the prior probability density of $\lambda$. The denominator is called model evidence, or marginal likelihood. One way to think about Bayes’ theorem is that it uses the data $X$ to update the prior information $\pi(\lambda)$ about $\lambda$, and returns the posterior $p(\lambda\mid X)$.

For a binary classifier $\lambda$ is the true class to which instance $X$ belongs and can take only two values, say 0 and 1. Bayes’ theorem then simplifies to:

To describe the performance of a classifier, $X$ could be summarized either by the class label $\ell$ assigned to it by the classifier, or by the score $q$ (a number between 0 and 1) assigned by the classifier to the hypothesis that $X$ belongs to class 1. Thus for example, $p(\lambda\!=\!1 \mid \ell\!=\!0)$ represents the posterior probability that the true class is 1 given a class label of 0, and $p(\lambda\!=\!1\mid q\!\ge\! q_{T})$ is the posterior probability that the true class is 1 given that the score $q$ is above a pre-specified threshold $q_{T}$.

Next we’ll take a look at each component of Bayes’ theorem in turn: the prior, the likelihood, the model evidence, and finally the posterior. Each of these maps to a classifier performance measure or a population characteristic.

### The prior

To simplify the notation let’s write $\lambda_{0}$ to signify $\lambda=0$, $\lambda_{1}$ for $\lambda=1$, and similarly with other variables. The first ingredient in the computation of the posterior is the prior $\pi(\lambda)$. To fix ideas, let’s assume that $\lambda_{1}$ is the class of interest, generically labeled “positive”; it indicates “effect”, “signal”, “disease”, “fraud”, or “spam”, depending on the context. Then $\lambda_{0}$ indicates the lack of these things and is generically labeled “negative”. To fully specify the prior we just need $\pi(\lambda_{1})$, since $\pi(\lambda_{0}) = 1 - \pi(\lambda_{1})$. The prior probability $\pi(\lambda_{1})$ of drawing a positive instance from the population is called the prevalence. It is a property of the population, not the classifier.

To simplify the notation even further let’s use $\pi_{1}$ for $\pi(\lambda=1)$ and $\pi_{0}$ for $\pi(\lambda=0)$.

### The likelihood

The next ingredient is the conditional probability $p(X\mid\lambda)$. For now let’s work with classifier labels and replace $X$ by $\ell$. This leads to the following four definitions:

• The true positive rate, also known as sensitivity or recall: $S_{e} \equiv p(\ell_{1}\mid \lambda_{1})$,

• The true negative rate, also known as specificity: $S_{p}\equiv p(\ell_{0}\mid \lambda_{0})$,

• The false-positive or Type-I-error rate: $\alpha\equiv p(\ell_{1}\mid\lambda_{0}) = 1 - p(\ell_{0}\mid\lambda_{0})$, and

• The false-negative or Type-II-error rate: $\beta\equiv p(\ell_{0}\mid\lambda_{1}) = 1 - p(\ell_{1}\mid\lambda_{1})$.

For example, $p(\ell_{1}\mid \lambda_{0})$ is the conditional probability that the classifier assigns a label of 1 to an instance actually belonging to class 0. Note that the above four rates are independent of the prevalence and are therefore intrinsic properties of the classifier.

When viewed as a function of true class $\lambda$, for fixed label $\ell$, the conditional probability $p(\ell\mid\lambda)$ is known as the likelihood function. This functional aspect of $p(\ell\mid\lambda)$ is not really used here, but it is good to remember the terminology.

### The model evidence

Finally, we need the normalization factor in Bayes’ theorem, the model evidence. For a binary classifier this has two components, corresponding to the two possible labels:

• The positive labeling rate: $p(\ell_{1}) = p(\ell_{1}\mid\lambda_{0})\,\pi_{0} + p(\ell_{1}\mid\lambda_{1})\,\pi_{1}$, and

• The negative labeling rate: $p(\ell_{0}) = p(\ell_{0}\mid\lambda_{0})\,\pi_{0} + p(\ell_{0}\mid\lambda_{1})\,\pi_{1}$.

The positive labeling rate is more commonly known as the queue rate, especially when it is calculated as the fraction of instances for which the classifier score $q$ is above a pre-specified threshold $q_{T}$.

Again to simplify the notation I’ll write $p_{0}$ for $p(\ell_{0})$ and $p_{1}$ for $p(\ell_{1})$ in the following.

### The posterior

Armed with the prior probability, the likelihood, and the model evidence, we can compute the posterior probability from Bayes’ theorem. There are four combinations of truth and label:

• The positive predictive value, also known as precision: ${\rm ppv}\equiv p(\lambda_{1}\mid\ell_{1}) = \frac{p(\ell_{1}\mid\lambda_{1})\, \pi_{1}}{p_{1}}$,

• The negative predictive value: ${\rm npv}\equiv p(\lambda_{0}\mid\ell_{0}) = \frac{p(\ell_{0}\mid\lambda_{0})\, \pi_{0}}{p_{0}}$,

• The false discovery rate: ${\rm fdr}\equiv p(\lambda_{0}\mid\ell_{1}) = 1 - {\rm ppv}$, and

• The false omission rate: ${\rm for}\equiv p(\lambda_{1}\mid\ell_{0}) = 1 - {\rm npv}$.

The precision, for example, quantifies the predictive value of a positive label by answering the question: If we see a positive label, what is the (posterior) probability that the true class is positive? (Note that the predictive values are not intrinsic properties of the classifier since they depend on the prevalence. Sometimes they are “standardized” by evaluating them at $\pi_{0} = \pi_{1} = 1/2$.)

Figure 1 shows how the posterior probabilities ${\rm ppv}$, ${\rm npv}$, ${\rm fdr}$ and ${\rm for}$ are related to the likelihoods $S_{e}$, $S_{p}$, $\alpha$ and $\beta$ via Bayes’ theorem: Figure 1: For each combination of class and label, Bayes' theorem connects the corresponding performance metrics of a classifier.

### Summary

This section introduced twelve quantities:

Only three of these are independent, say $\pi_{1}$, $\alpha$ and $\beta$. To see this, note that by virtue of their definition the first six quantities satisfy three conditions:

and the last six can be expressed in terms of the first six: for $p_{0}$ and $p_{1}$ we have:

and Bayes’ theorem takes care of the remaining four (Figure 1).

The final number of independent quantities matches the number of degrees of freedom of a two-by-two contingency table such as the confusion matrix, which is used to estimate all twelve quantities (see below). It also tells us that we only need two numbers to fully characterize a binary classifier (since $\pi_{1}$ is not a classifier property). As trivial examples, consider the majority and minority classifiers. Assume that class 0 is more prevalent than class 1. Then the majority classifier assigns the label 0 to every instance and has $\alpha=0$ and $\beta=1$. The minority classifier assigns the label 1 to every instance and has $\alpha=1$ and $\beta=0$. These classifiers are of course not very useful. As we will show below, for a classifier to be useful, it must satisfy $\alpha + \beta \lt 1$.

## Joint probabilities

A couple of joint probabilities lead to performance metrics with very different properties.

Consider first the joint probability of class label and true class, which can be decomposed in terms of likelihood and prior:

Projecting this joint probability on the axis “$\ell=\lambda$” yields the accuracy, or the probability to correctly identify any instance, negative or positive:

The last expression on the right shows that the accuracy depends on the prevalence and is therefore not an intrinsic property of the classifier. An equivalent measure is the misclassification rate, defined as one minus the accuracy. A benchmark that is sometimes used is the null error rate, defined as the misclassification rate of a classifier that always predicts the majority class. It is equal to $\min\{\pi_{1}, 1-\pi_{1}\}$. The complement of the null error rate is the null accuracy, which is equal to $\max\{\pi_{1}, 1-\pi_{1}\}$.

The second joint probability to consider is that of the scores assigned by the classifier to two independently drawn instances, one from the positive class and one from the negative class. A good measure of performance is the probability that the positive instance is scored higher than the negative instance. This measure equals the Area Under the Receiver Operating Characteristic (AUROC), a curve of the true positive rate as a function of the false positive rate (or sensitivity versus one-minus-specificity). By construction, the AUROC is independent of prevalence, and does not require a choice of threshold $q_{T}$ on the classifier score $q$. In other words, it does not depend on the chosen operating point of the classifier.

There is actually a relationship between the AUROC and the accuracy, via an expectation. Imagine an ensemble of binary classifiers like the one considered here, but with thresholds $q_{T}$ distributed like the $q$ scores of the instances in the population we are trying to classify. The expected (i.e. average) accuracy of the classifiers in this ensemble is given by:

and is thus linearly related to the AUROC. In a sense, the AUROC aggregates the classifier accuracy information in a way that’s independent of prevalence (see this answer by Peter Flach on Quora for a more detailed explanation).

## Likelihood ratios

In some fields it is customary to work with ratios of probabilities rather than with the probabilities themselves. Thus one defines:

• The positive likelihood ratio: ${\rm lr+} \equiv \frac{p(\ell_{1}\mid\lambda_{1})}{p(\ell_{0}\mid\lambda_{1})} = \frac{S_{e}}{1-S_{e}}$, which represents the odds of a positive label if the true class is positive (in medical statistics for example, this would be the odds of a positive test if patient has the disease).

• The negative likelihood ratio: ${\rm lr-} \equiv \frac{p(\ell_{1}\mid\lambda_{0})}{p(\ell_{0}\mid\lambda_{0})} = \frac{1-S_{p}}{S_{p}}$, which represents the odds of a positive label if the true class is negative (in medical statistics, the odds of a positive test if patient does not have the disease).

Finally, it is instructive to take the ratio of these two likelihood ratios. This yields

• The diagnostic odds ratio: ${\rm dor} \equiv \frac{\rm lr+}{\rm lr-} = \frac{S_{e}}{1-S_{e}}\frac{S_{p}}{1-S_{p}}$.

If ${\rm dor} \gt 1$, the classifier is deemed useful, since this means that, in our medical example, the odds of a positive test are higher for a patient with the disease than for a healthy patient. If ${\rm dor}=1$ the classifier is useless, but if ${\rm dor}\lt 1$ the classifier is worse than useless, it is misleading. In this case one would be better off swapping the labels on the classifier output.

Note that the ratios ${\rm lr+}$, ${\rm lr-}$, and ${\rm dor}$ are all independent of prevalence.

## When is a classifier useful?

The usefulness condition $\fbox{dor \gt 1}$ is mathematically equivalent to any of the following conditions:

1. $\fbox{S_{e}\gt 1 - S_{p}}$ The sensitivity must be larger than one minus the specificity. Equivalently, the probability of detecting a positive-class instance must be larger than the probability of mislabeling a negative-class instance. This condition explains why, for a useful classifier, the ROC always lies above the main diagonal in a plot of $S_{e}$ versus $1 - S_{p}$.

2. $\fbox{\beta \lt 1 - \alpha}$ A reformulation of condition 1 in terms of Type-I and II error rates.

3. $\fbox{ppv \gt \pi_{1}}$ The precision must be larger than the prevalence. In other words, the probability for an instance to belong to the positive class must be larger within the subset of instances with a positive label than within the entire population. If this is not the case, the classifier adds no useful information. Similarly: $\fbox{npv \gt \pi_{0}}$.

4. $\fbox{S_{e} \gt p_{1}}$ This follows from condition 3 and Bayes’ theorem for ${\rm ppv}$. The probability of encountering a positively labeled instance must be larger within the subset of positive-class instances than within the entire population. Similarly: $\fbox{S_{p} \gt p_{0}}$, $\fbox{\alpha \lt p_{1}}$, and $\fbox{\beta \lt p_{0}}$.

5. $\fbox{fdr \lt \pi_{0}}$ The probability of encountering a negative-class instance must be smaller within the subset of positively labeled instances than within the entire population. Similarly: $\fbox{for \lt \pi_{1}}$.

The classifier usefulness condition also puts a bound on the accuracy:

It is straightforward to verify that the majority and minority classifiers defined earlier both fail any of the above usefulness conditions. As a counter-example, imagine starting with the majority classifier and flipping the label on a single positive instance. Thus, this classifier sets $\ell=1$ on one $\lambda=1$ instance and $\ell=0$ on all other instances. It has 100% precision on positive labels ($\mbox{ppv}=1$). For negative labels the precision is $\mbox{npv} = \pi_{0}/(1-1/N)$, with $N$ the total number of instances. Hence this classifier passes the usefulness condition, although barely in the case of $\mbox{npv}$ and large $N$.

## Performance metric estimation

The performance metrics can be estimated from a two-by-two contingency table known as the confusion matrix. It is obtained by applying the classifier to a test data set different from the training data set. Here is standard notation for this matrix: Figure 2: Estimated confusion matrix. The notation tp stands for "number of true positives", fn for "number of false negatives", and so on.

The rows correspond to class labels, the columns to true classes. We then have the following approximations:

There is no simple expression for the AUROC, since the latter requires integration under the curve of true versus false positive rates.

Note that the relations between probabilities derived in the previous sections also hold between their estimates.

## Effect of prevalence on classifier training and testing

In this post I have tried to be careful in pointing out which performance metrics depend on the prevalence, and which don’t. The reason is that prevalence is a population or sample property, whereas here we are interested in classifier properties. Of course, real-world classifiers must be trained and tested on finite samples, and this has implications for the actual effect of prevalence on classifier performance:

• When training a classifier, changing the prevalence in the training data may affect properties such as the sensitivity and specificity, depending on the characteristics of the training algorithm.

• When testing a trained classifier, changing the prevalence in the test data will not affect properties such as the sensitivity and specificity (within statistical fluctuations).

To illustrate this point I went back to a previous post on predicting flight delays with a random forest. The prevalence of the given flight delay data set is about 20%, but can be increased by randomly dropping class 0 instances (non-delayed flights). The table below shows the effect of changing the prevalence in this way in the training data:

Training set prevalence Sensitivity Specificity Accuracy AUROC
0.40 0.16 0.96 0.72 0.72
0.42 0.29 0.91 0.72 0.72
0.44 0.46 0.82 0.71 0.72
0.46 0.58 0.72 0.68 0.71
0.48 0.63 0.67 0.66 0.72
0.50 0.73 0.57 0.62 0.71

To be clear, for each row it is the same classifier (same hyper-parameters: number of trees, maximum tree depth, etc.) that is being fit to a training data set with varying prevalence. The sensitivity, specificity, accuracy, and AUROC are then measured on a testing data set with a constant prevalence of 30%. Note how the sensitivity and specificity, as measured on an independent test data set, vary with training set prevalence. The AUROC, on the other hand, remains stable.

Next, let’s look at the effect of changing the prevalence in the testing data:

Testing set prevalence Sensitivity Specificity Accuracy AUROC
0.20 0.50 0.78 0.73 0.72
0.30 0.48 0.80 0.70 0.72
0.40 0.45 0.82 0.67 0.72
0.50 0.46 0.80 0.63 0.71
0.60 0.50 0.79 0.61 0.71

Here the prevalence in the training data was set at 0.45. The sensitivity and specificity appear much more stable; residual variations may be due to the effect on the training data set of the procedure used to vary the testing set prevalence.

Training data sets are often unbalanced, with prevalences even lower than in the example discussed here. Modules such as scikit-learn’s RandomForestClassifier provide a class_weight option, which can be set to balanced or balanced_subsample in order to assign a larger weight to the minority class. This re-weighting is applied both to the GINI criterion used for finding splits at decision tree nodes and to class votes in the terminal nodes. The sensitivity and specificity of a classifier, as well as its accuracy, will depend on the setting of the class_weight option. For scoring purposes when optimizing a classifier, AUROC tends to be a much more robust property.