Spaces:
Running
Running
might be defunct now
Browse files
README.md
CHANGED
@@ -20,7 +20,14 @@ pinned: false
|
|
20 |
<!---
|
21 |
*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
|
22 |
-->
|
23 |
-
`ECE` is a standard metric to evaluate top-1 prediction miscalibration.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
|
26 |
## How to Use
|
@@ -30,6 +37,8 @@ pinned: false
|
|
30 |
-->
|
31 |
|
32 |
|
|
|
|
|
33 |
### Inputs
|
34 |
<!---
|
35 |
*List all input arguments in the format below*
|
@@ -52,11 +61,12 @@ pinned: false
|
|
52 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
53 |
-->
|
54 |
|
|
|
55 |
## Limitations and Bias
|
56 |
<!---
|
57 |
*Note any known limitations or biases that the metric has, with links and references if possible.*
|
58 |
-->
|
59 |
-
See [3],[4] and [5]
|
60 |
|
61 |
## Citation
|
62 |
[1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
|
@@ -64,6 +74,7 @@ See [3],[4] and [5]
|
|
64 |
[3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
|
65 |
[4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
|
66 |
[5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
|
|
|
67 |
|
68 |
## Further References
|
69 |
*Add any useful further references.*
|
|
|
20 |
<!---
|
21 |
*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
|
22 |
-->
|
23 |
+
Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
|
24 |
+
It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
|
25 |
+
$$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\|^p_p\right]$$, where $\hat{y} = \argmax_{y'}[f(X)]_y'$ is a class prediction with associated posterior probability $\hat{p}= \max_{y'}[f(X)]_y'$.
|
26 |
+
|
27 |
+
It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.
|
28 |
+
|
29 |
+
As a metric of calibration *error*, it holds that the lower, the better calibrated a model is.
|
30 |
+
For valid model comparisons, ensure to use the same keyword arguments.
|
31 |
|
32 |
|
33 |
## How to Use
|
|
|
37 |
-->
|
38 |
|
39 |
|
40 |
+
|
41 |
+
|
42 |
### Inputs
|
43 |
<!---
|
44 |
*List all input arguments in the format below*
|
|
|
61 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
62 |
-->
|
63 |
|
64 |
+
|
65 |
## Limitations and Bias
|
66 |
<!---
|
67 |
*Note any known limitations or biases that the metric has, with links and references if possible.*
|
68 |
-->
|
69 |
+
See [3],[4] and [5].
|
70 |
|
71 |
## Citation
|
72 |
[1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
|
|
|
74 |
[3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
|
75 |
[4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
|
76 |
[5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
|
77 |
+
[6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
|
78 |
|
79 |
## Further References
|
80 |
*Add any useful further references.*
|
app.py
CHANGED
@@ -3,4 +3,5 @@ from evaluate.utils import launch_gradio_widget
|
|
3 |
|
4 |
|
5 |
module = evaluate.load("jordyvl/ece")
|
6 |
-
launch_gradio_widget(module)
|
|
|
|
3 |
|
4 |
|
5 |
module = evaluate.load("jordyvl/ece")
|
6 |
+
launch_gradio_widget(module)
|
7 |
+
|
ece.py
CHANGED
@@ -13,6 +13,8 @@
|
|
13 |
# limitations under the License.
|
14 |
"""TODO: Add a description here."""
|
15 |
|
|
|
|
|
16 |
import evaluate
|
17 |
import datasets
|
18 |
import numpy as np
|
@@ -29,7 +31,8 @@ year={2022}
|
|
29 |
|
30 |
# TODO: Add description of the module here
|
31 |
_DESCRIPTION = """\
|
32 |
-
This new module is designed to
|
|
|
33 |
"""
|
34 |
|
35 |
|
@@ -41,6 +44,9 @@ Args:
|
|
41 |
should be a string with tokens separated by spaces.
|
42 |
references: list of reference for each prediction. Each
|
43 |
reference should be a string with tokens separated by spaces.
|
|
|
|
|
|
|
44 |
Returns:
|
45 |
accuracy: description of the first score,
|
46 |
another_score: description of the second score,
|
@@ -55,14 +61,50 @@ Examples:
|
|
55 |
"""
|
56 |
|
57 |
# TODO: Define external resources urls if needed
|
58 |
-
BAD_WORDS_URL = "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
|
|
|
|
|
|
60 |
|
61 |
-
|
|
|
62 |
|
63 |
-
|
64 |
-
|
65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
# Tie-breaking to the left for rightmost bin
|
67 |
# Using `digitize`, values that fall on an edge are put in the right bin.
|
68 |
# For the rightmost bin, we want values equal to the right
|
@@ -72,7 +114,7 @@ def bin_idx_dd(P, bins):
|
|
72 |
# Find the rounding precision
|
73 |
dedges_min = np.diff(bins).min()
|
74 |
if dedges_min == 0:
|
75 |
-
raise ValueError(
|
76 |
|
77 |
decimal = int(-np.log10(dedges_min)) + 6
|
78 |
|
@@ -87,48 +129,49 @@ def bin_idx_dd(P, bins):
|
|
87 |
|
88 |
|
89 |
def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
|
90 |
-
|
91 |
-
binnumbers = bin_idx_dd(np.expand_dims(P, 0), bins)[0]
|
92 |
result = np.empty([len(bins)], float)
|
93 |
-
result.fill(np.nan)
|
94 |
|
95 |
-
flatcount = np.bincount(
|
96 |
a = flatcount.nonzero()
|
97 |
|
98 |
-
if statistic ==
|
99 |
-
flatsum = np.bincount(
|
100 |
result[a] = flatsum[a] / flatcount[a]
|
101 |
-
return result, bins,
|
|
|
102 |
|
103 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
104 |
"""
|
105 |
y_correct: binary (N x 1)
|
106 |
P: normalized (N x 1) either max or per class
|
107 |
|
108 |
-
Summary: weighted average over the accuracy/confidence difference of
|
109 |
"""
|
110 |
|
111 |
-
|
112 |
-
|
113 |
-
n_bins = n_bins
|
114 |
-
bin_range = [0, 1]
|
115 |
-
bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1)
|
116 |
-
# expected; equal range binning
|
117 |
-
else:
|
118 |
-
n_bins = len(bins) - 1
|
119 |
-
bin_range = [min(bins), max(bins)]
|
120 |
-
|
121 |
-
# average bin probability #55 for bin 50-60; mean per bin
|
122 |
-
calibrated_acc = bins[1:] # right/upper bin edges
|
123 |
-
# calibrated_acc = bin_centers(bins)
|
124 |
|
|
|
|
|
125 |
|
126 |
empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
|
127 |
bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
|
128 |
anindices = bin_numbers - 1 # reduce bin counts; left edge; indexes right BY DEFAULT
|
129 |
|
130 |
# Expected calibration error
|
131 |
-
if p < np.inf: #
|
132 |
CE = np.average(
|
133 |
abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
|
134 |
weights=weights_ece, # weighted average 1/binfreq
|
@@ -138,11 +181,14 @@ def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
|
|
138 |
|
139 |
return CE
|
140 |
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
|
|
|
|
|
|
146 |
|
147 |
|
148 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
@@ -157,9 +203,18 @@ class ECE(evaluate.EvaluationModule):
|
|
157 |
4. apply L^p norm distance and weights
|
158 |
"""
|
159 |
|
160 |
-
#have to add to initialization here?
|
161 |
-
#create bins using the params
|
162 |
-
#create proxy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
163 |
|
164 |
def _info(self):
|
165 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
@@ -170,15 +225,17 @@ class ECE(evaluate.EvaluationModule):
|
|
170 |
citation=_CITATION,
|
171 |
inputs_description=_KWARGS_DESCRIPTION,
|
172 |
# This defines the format of each prediction and reference
|
173 |
-
features=datasets.Features(
|
174 |
-
|
175 |
-
|
176 |
-
|
|
|
|
|
177 |
# Homepage of the module for documentation
|
178 |
-
homepage="
|
179 |
# Additional links to the codebase or references
|
180 |
codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
181 |
-
reference_urls=["http://path.to.reference.url/new_module"]
|
182 |
)
|
183 |
|
184 |
def _download_and_prepare(self, dl_manager):
|
@@ -188,20 +245,24 @@ class ECE(evaluate.EvaluationModule):
|
|
188 |
|
189 |
def _compute(self, predictions, references):
|
190 |
"""Returns the scores"""
|
191 |
-
|
|
|
192 |
return {
|
193 |
"ECE": ECE,
|
194 |
}
|
195 |
|
196 |
|
197 |
def test_ECE():
|
198 |
-
N = 10
|
199 |
-
K = 5
|
200 |
-
|
201 |
-
def random_mc_instance(concentration=1):
|
202 |
-
reference = np.argmax(
|
203 |
-
|
204 |
-
#
|
|
|
|
|
|
|
205 |
return reference, prediction
|
206 |
|
207 |
references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
|
@@ -210,5 +271,10 @@ def test_ECE():
|
|
210 |
res = ECE()._compute(predictions, references)
|
211 |
print(f"ECE: {res['ECE']}")
|
212 |
|
213 |
-
|
214 |
-
|
|
|
|
|
|
|
|
|
|
|
|
13 |
# limitations under the License.
|
14 |
"""TODO: Add a description here."""
|
15 |
|
16 |
+
# https://huggingface.co/spaces/jordyvl/ece
|
17 |
+
|
18 |
import evaluate
|
19 |
import datasets
|
20 |
import numpy as np
|
|
|
31 |
|
32 |
# TODO: Add description of the module here
|
33 |
_DESCRIPTION = """\
|
34 |
+
This new module is designed to evaluate the calibration of a probabilistic classifier.
|
35 |
+
More concretely, we provide a binned empirical estimator of top-1 calibration error. [1]
|
36 |
"""
|
37 |
|
38 |
|
|
|
44 |
should be a string with tokens separated by spaces.
|
45 |
references: list of reference for each prediction. Each
|
46 |
reference should be a string with tokens separated by spaces.
|
47 |
+
|
48 |
+
|
49 |
+
|
50 |
Returns:
|
51 |
accuracy: description of the first score,
|
52 |
another_score: description of the second score,
|
|
|
61 |
"""
|
62 |
|
63 |
# TODO: Define external resources urls if needed
|
64 |
+
BAD_WORDS_URL = ""
|
65 |
+
|
66 |
+
|
67 |
+
# Discretization and binning
|
68 |
+
def create_bins(n_bins=10, scheme="equal-range", bin_range=None, P=None):
|
69 |
+
assert scheme in [
|
70 |
+
"equal-range",
|
71 |
+
"equal-masss",
|
72 |
+
], f"This binning scheme {scheme} is not implemented yet"
|
73 |
+
|
74 |
+
if bin_range is None:
|
75 |
+
if P is None:
|
76 |
+
bin_range = [0, 1] # no way to know range
|
77 |
+
else:
|
78 |
+
if scheme == "equal-range":
|
79 |
+
bin_range = [min(P), max(P)]
|
80 |
|
81 |
+
if scheme == "equal-range":
|
82 |
+
bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1) # equal range
|
83 |
+
# bins = np.tile(np.linspace(bin_range[0], bin_range[1], n_bins + 1), (n_classes,1))
|
84 |
|
85 |
+
elif scheme == "equal-mass":
|
86 |
+
assert P.size >= n_bins, "Fewer points than bins"
|
87 |
|
88 |
+
# assume global equal mass binning; not discriminated per class
|
89 |
+
P = P.flatten()
|
90 |
|
91 |
+
# split sorted probabilities into groups of approx equal size
|
92 |
+
groups = np.array_split(np.sort(P), n_bins)
|
93 |
+
bin_upper_edges = list()
|
94 |
+
|
95 |
+
# rightmost entry per equal size group
|
96 |
+
for cur_group in range(n_bins - 1):
|
97 |
+
bin_upper_edges += [max(groups[cur_group])]
|
98 |
+
bin_upper_edges += [np.inf] # always +1 for right edges
|
99 |
+
bins = np.array(bin_upper_edges)
|
100 |
+
|
101 |
+
return bins
|
102 |
+
|
103 |
+
|
104 |
+
def discretize_into_bins(P, bins):
|
105 |
+
oneDbins = np.digitize(P, bins) - 1 # since bins contains extra righmost & leftmost bins
|
106 |
+
|
107 |
+
# Fix to scipy.binned_dd_statistic:
|
108 |
# Tie-breaking to the left for rightmost bin
|
109 |
# Using `digitize`, values that fall on an edge are put in the right bin.
|
110 |
# For the rightmost bin, we want values equal to the right
|
|
|
114 |
# Find the rounding precision
|
115 |
dedges_min = np.diff(bins).min()
|
116 |
if dedges_min == 0:
|
117 |
+
raise ValueError("The smallest edge difference is numerically 0.")
|
118 |
|
119 |
decimal = int(-np.log10(dedges_min)) + 6
|
120 |
|
|
|
129 |
|
130 |
|
131 |
def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
|
132 |
+
bin_assignments = discretize_into_bins(np.expand_dims(P, 0), bins)[0]
|
|
|
133 |
result = np.empty([len(bins)], float)
|
134 |
+
result.fill(np.nan) # cannot assume each bin will have observations
|
135 |
|
136 |
+
flatcount = np.bincount(bin_assignments, None)
|
137 |
a = flatcount.nonzero()
|
138 |
|
139 |
+
if statistic == "mean":
|
140 |
+
flatsum = np.bincount(bin_assignments, y_correct)
|
141 |
result[a] = flatsum[a] / flatcount[a]
|
142 |
+
return result, bins, bin_assignments + 1 # fix for what happens in discretize_into_bins
|
143 |
+
|
144 |
|
145 |
+
def bin_calibrated_accuracy(bins, proxy="upper-edge"):
|
146 |
+
assert proxy in ["center", "upper-edge"], f"Unsupported proxy{proxy}"
|
147 |
+
|
148 |
+
if proxy == "upper-edge":
|
149 |
+
return bins[1:]
|
150 |
+
|
151 |
+
if proxy == "center":
|
152 |
+
return bins[:-1] + np.diff(bins) / 2
|
153 |
+
|
154 |
+
|
155 |
+
def CE_estimate(y_correct, P, bins=None, p=1, proxy="upper-edge"):
|
156 |
"""
|
157 |
y_correct: binary (N x 1)
|
158 |
P: normalized (N x 1) either max or per class
|
159 |
|
160 |
+
Summary: weighted average over the accuracy/confidence difference of discrete bins of prediction probability
|
161 |
"""
|
162 |
|
163 |
+
n_bins = len(bins) - 1
|
164 |
+
bin_range = [min(bins), max(bins)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
|
166 |
+
# average bin probability #55 for bin 50-60, mean per bin; or right/upper bin edges
|
167 |
+
calibrated_acc = bin_calibrated_accuracy(bins, proxy="upper-edge")
|
168 |
|
169 |
empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
|
170 |
bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
|
171 |
anindices = bin_numbers - 1 # reduce bin counts; left edge; indexes right BY DEFAULT
|
172 |
|
173 |
# Expected calibration error
|
174 |
+
if p < np.inf: # L^p-CE
|
175 |
CE = np.average(
|
176 |
abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
|
177 |
weights=weights_ece, # weighted average 1/binfreq
|
|
|
181 |
|
182 |
return CE
|
183 |
|
184 |
+
|
185 |
+
def top_1_CE(Y, P, **kwargs):
|
186 |
+
y_correct = (Y == np.argmax(P, -1)).astype(int) # create condition y = ŷ € [K]
|
187 |
+
p_max = np.max(P, -1) # create p̂ as top-1 softmax probability € [0,1]
|
188 |
+
bins = create_bins(
|
189 |
+
n_bins=kwargs["n_bins"], bin_range=kwargs["bin_range"], scheme=kwargs["scheme"], P=p_max
|
190 |
+
)
|
191 |
+
return CE_estimate(y_correct, p_max, bins=bins, proxy=kwargs["proxy"])
|
192 |
|
193 |
|
194 |
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
|
|
203 |
4. apply L^p norm distance and weights
|
204 |
"""
|
205 |
|
206 |
+
# have to add to initialization here?
|
207 |
+
# create bins using the params
|
208 |
+
# create proxy
|
209 |
+
|
210 |
+
def __init__(self, n_bins=10, bin_range=None, scheme="equal-range", proxy="upper-edge", p=1):
|
211 |
+
super().__init__(self)
|
212 |
+
|
213 |
+
self.bin_range = bin_range
|
214 |
+
self.n_bins = n_bins
|
215 |
+
self.scheme = scheme
|
216 |
+
self.proxy = proxy
|
217 |
+
self.p = p
|
218 |
|
219 |
def _info(self):
|
220 |
# TODO: Specifies the evaluate.EvaluationModuleInfo object
|
|
|
225 |
citation=_CITATION,
|
226 |
inputs_description=_KWARGS_DESCRIPTION,
|
227 |
# This defines the format of each prediction and reference
|
228 |
+
features=datasets.Features(
|
229 |
+
{
|
230 |
+
"predictions": datasets.Value("float32"),
|
231 |
+
"references": datasets.Value("int64"),
|
232 |
+
}
|
233 |
+
),
|
234 |
# Homepage of the module for documentation
|
235 |
+
homepage="https://huggingface.co/spaces/jordyvl/ece",
|
236 |
# Additional links to the codebase or references
|
237 |
codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
|
238 |
+
reference_urls=["http://path.to.reference.url/new_module"],
|
239 |
)
|
240 |
|
241 |
def _download_and_prepare(self, dl_manager):
|
|
|
245 |
|
246 |
def _compute(self, predictions, references):
|
247 |
"""Returns the scores"""
|
248 |
+
|
249 |
+
ECE = top_1_CE(references, predictions)
|
250 |
return {
|
251 |
"ECE": ECE,
|
252 |
}
|
253 |
|
254 |
|
255 |
def test_ECE():
|
256 |
+
N = 10 # N evaluation instances {(x_i,y_i)}_{i=1}^N
|
257 |
+
K = 5 # K class problem
|
258 |
+
|
259 |
+
def random_mc_instance(concentration=1, onehot=False):
|
260 |
+
reference = np.argmax(
|
261 |
+
np.random.dirichlet(([concentration for _ in range(K)])), -1
|
262 |
+
) # class targets
|
263 |
+
prediction = np.random.dirichlet(([concentration for _ in range(K)])) # probabilities
|
264 |
+
if onehot:
|
265 |
+
reference = np.eye(K)[np.argmax(reference, -1)]
|
266 |
return reference, prediction
|
267 |
|
268 |
references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
|
|
|
271 |
res = ECE()._compute(predictions, references)
|
272 |
print(f"ECE: {res['ECE']}")
|
273 |
|
274 |
+
|
275 |
+
if __name__ == "__main__":
|
276 |
+
test_ECE()
|
277 |
+
|
278 |
+
|
279 |
+
# if scheme == "equal-mass":
|
280 |
+
# raise AssertionError("Need to calculate based on P") #so cannot instantiate yet
|