-
Notifications
You must be signed in to change notification settings - Fork 463
Description
🚀 Feature
I would like to contribute to torchmetrics, by implementing the Brier score and its associated decomposition.
Motivation
The Brier score is widely used when measuring the calibration of machine learning methods see:
https://arxiv.org/abs/2302.04019
https://arxiv.org/abs/2002.06470
It is also a proper scoring rule as opposed to the Expected Calibration Error (ECE) and Thresholded Calibration Error (TACE). This means that the ECE and the TACE have trivial minima where the classifier has zero test accuracy while being perfectly calibrated (https://arxiv.org/abs/1906.02530). The Brier score being a proper scoring rule doesn't have this pathological behaviour.
The Brier score coincides with the mean squared error for common use cases. However, its decomposition into resolution, reliability and uncertainty see https://en.wikipedia.org/wiki/Brier_score is a unique and useful feature. Roughly speaking resolution
captures a notion of accuracy and reliability
a notion of calibration. Thus both have to be optimized for the Brier score to be low.
Finally, no standard implementation in common packages exists to the best of my knowledge.
Pitch
I plan to follow the original paper describing the decomposition of the Brier score into resolution, reliability and uncertainty
https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml
and specifically the implementation found in
https://github.com/google-research/google-research/blob/master/uq_benchmark_2019/metrics_lib.py
and the paper
https://arxiv.org/abs/1906.02530
The decomposition into uncertainty, resolution and reliability was originally formulated for predictions which take a finite set of values. This is in contrast with the output vectors of most deep neural network classifiers which output a vector of probabilities per class, which take continuous values. Thus we need to create bins for our output vectors. The specific bins in this implementation are with respect to the top most probable class for each input signal. Thus we create C bins where C is the number of classes. Then two prediction vectors [0 , 0.9, 0.1] and [0.2, 0.6, 0.2] fall in the same bin, the bin of class 2. The derivation of resolution, reliability and uncertainty is then relatively straighforward.