ROC vs PRC
In classification problems, AUC (Area Under the Curve) is one of the most important evaluation metrics to measure the model’s performance. However, do you know which “curve” need to use in AUC? The default choice is typically AUC-ROC (Receiver Operating Characteristics). However, there is another common choice: AUC-PRC (Precision Recall Curve). In this article, we will learn their differences and their application scenarios.
What is confusion matrix?
It is better to introduce some notations using a confusion matrix. A confusion matrix is a 2-by-2 table to show all the combinations of actual data and predicted results: {True Positive (TP), False Negative (FN), False Positive (FP), True Negative (TN)}. Then, we will have some more notations:
- Predicted Positive: PP = TP + FP
- Predicted Negative: PN = FN + TN
- Actual Positive: AP = TP + FN
- Actual Negative: AN = FP + TN
What is ROC?
ROC is first developed by electrical engineers and radar engineers in World War II. It is now used in machine learning to measure the performance of a classification model with various probability threshold settings. To understand ROC curve, we need to know its two axises:
- x axis: False positive rates under different probability thresholds. The formula is FP/AN.
- y axis: True positive rates under different probability thresholds. The formula is TP/AP
Observing the ROC curve, it has 2 traits:
- y is monotonic to x. (The key idea of proof is to show both y and x are monotonic to the probability threshold: as the probability threshold goes up (down), there will be less (more) FP and also less (more) TP, since AN and AP are constant for given data, both y and x will go up (down) based on their formula. Then, it is straightforward to show that y is monotonic to x.)
- For highly imbalanced data, either x or y is NOT sensitive to the change of a classification model’s performance. (Highly imbalanced data has either very large AN or very large AP. Given that, either x or y will become too small to reflect the change in a model’s performance.)
Based on these 2 traits above, we can derive the following conclusions on the ROC curve:
- It is OK to linearly interpolate between points of ROC curve
- ROC curve is bending upward and smoothly
- For highly imbalanced data, the change in ROC curve is NOT apparent for different algorithms. Thus, AUC-ROC is NOT appropriate to measure the performance of a classification model for highly imbalanced data.
What is PRC?
PRC is more intuitive to understand than ROC, in my opinion. It shows the trade-off between precision and recall under different probability thresholds. For its axises:
- x axis: Recall levels (i.e. True positive rate) under different probability thresholds. The formula is TP/AP.
- y axis: Precision levels under different probability thresholds. The formula is TP/PP
Observing the PRC, it has 2 traits:
- y is NOT monotonic to x. (The key idea of proof is to show y is not monotonic to the probability threshold: e.g. as the probability threshold goes up, there will be less TP and less PP, but it is not clear which decreases more, thus y could go up or down depending on different values of probability thresholds.)
- For highly imbalanced data, y is still sensitive to the change of a classification model’s performance. (The denominator of y is not influenced by the prevalence of data.)
Based on these 2 traits above, we can derive the following conclusions on the PRC:
- It is WRONG to linearly interpolate between points of ROC curve
- ROC curve is bending downward with zigzaps
- For highly imbalanced data, the change in PRC is still apparent for different algorithms. Thus, AUC-PRC is appropriate to measure the performance of a classification model for highly imbalanced data.
The relationship between ROC and PRC
This section is based on the analysis of Davis and Goadrich (2006). Please refer to their paper if interested in details.
- ROC and PRC have one-to-one correspondence
- Algorithm B has its ROC dominates algorithm A (i.e. above A’s ROC curve) ⇔ Algorithm B has its PRC dominates algorithm A
- Algorithm A optimizes AUC-ROC ⇏ Algorithm A optimizes AUC-PRC. The picture below shows a counterexample: Curve I and II are the same in terms of AUC-ROC but different in terms of AUC-PRC.
When shall we use AUC-ROC? AUC-PRC?
The analysis above helps derive the conclusion below:
- For mild imbalanced (or balanced) data, use AUC-ROC
- For highly imbalanced data, use AUC-PRC
However, there is one drawback of AUC-PRC when commercialising algorithms (to customers): there is no general rule to tell whether an algorithm is good enough when using AUC-PRC. Because AUC-PRC does not have a reference level for us to compare the model to, in one problem an algorithm with AUC-PRC above 0.3 may be good enough, but in another problem, it may be not. Since AUC-ROC has a clear reference level to compare with (i.e. purely random assignment), it is easier to commercialise algorithms. For example, an algorithm with AUC-ROC above 0.85 is usually considered as good.
In order to alleviate this problem, there is an alternative way when facing with highly imbalanced data. (Of course, there are many other ways such as downsampling, upsampling and weighted loss function.) We can still use AUC-ROC, but combine it with (the optimal) F1 score and (the optimal) Cohen’s Kappa score in the model evaluation:
- F1 score: evaluate the model’s performance from the perspective of precision and recall
- Cohen’s Kappa score: evaluate the model’s performance against random assignment according to the prevalence rate of data
Reference
Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233–240).