Beyond AUC with Average Precision

PURPOSE

Area under receiver operating curve (AUC) is commonly used to evaluate and select artificial intelligence (AI) models for radiology. Artificially balanced/enriched datasets are also usually used to estimate AUC to maximize confidence interval to sample size ratio. In this work, we show that such evaluation of model performance has reached saturation and propose alternate performance evaluation schemes.

METHOD AND MATERIALS

Receiver operating curve (ROC) is a curve where false positive rate (1 – specificity) and sensitivity of model at different thresholds are plotted on x and y-axes respectively. Similarly, precision recall curve (PRC) is plotted with recall (sensitivity) and precision (positive predictive value) on x and y-axes respectively. AUC is defined as area under ROC while average precision (AP) is defined as area under PRC. To illustrate the proposed evaluation scheme, two different high-performance models to detect fractures from head CT scans were created. In addition, two datasets were created by uniformly sampling scans and artificially enriching scans with fractures respectively. AUCs and APs were computed for the model-dataset pairs. We propose that AP computed on uniformly sampled dataset is more useful for model selection than other options.

RESULTS

AUCs for all four (model, dataset) pairs were >92%. For both the datasets, difference in AUCs between the models was less than 3%. APs on enriched dataset were high for both models (95% & 92% respectively). However, APs on uniformly sampled dataset were lower than expected (80% & 69% respectively). The difference in models' performance was the highest (difference of 11%) when performance was measured using AP on uniformly sampled dataset.

CONCLUSION

AUC, although a commonly used performance metric for models, saturates early. Therefore, it is not suitable for model selection among high performance models (i.e. AUC > 0.9). Similarly, model selection using artificially enriched datasets is not a good practice as both AUC and AP saturate early. Average precision measured on a uniformly sampled dataset shows the deficiencies in models' performance well and therefore, is a better metric for model selection.

CLINICAL RELEVANCE/APPLICATION

Average precision and uniformly sampled datasets should be used to evaluate artificial intelligence models in radiology instead of AUC and enriched datasets.

AUC and Enriched Datasets are Not Good Enough Anymore: Presenting an Alternative Metric to Evaluate Radiology AI Models