Spaces:

jordyvl
/

ece

Running

App Files Files Community

jordyvl commited on Jun 5, 2022

Commit

0c94397

•

1 Parent(s): ac1bb79

might be defunct now

Browse files

Files changed (3) hide show

README.md +13 -2
app.py +2 -1
ece.py +119 -53

README.md CHANGED Viewed

@@ -20,7 +20,14 @@ pinned: false
 <!---
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 -->
-`ECE` is a standard metric to evaluate top-1 prediction miscalibration. Generally, the lower the better.
 ## How to Use
@@ -30,6 +37,8 @@ pinned: false
 -->
 ### Inputs
 <!---
 *List all input arguments in the format below*
@@ -52,11 +61,12 @@ pinned: false
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 -->
 ## Limitations and Bias
 <!---
 *Note any known limitations or biases that the metric has, with links and references if possible.*
 -->
-See [3],[4] and [5]
 ## Citation
 [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
@@ -64,6 +74,7 @@ See [3],[4] and [5]
 [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
 [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
 [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 ## Further References
 *Add any useful further references.*

 <!---
 *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 -->
+Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
+It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
+$$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\|^p_p\right]$$, where $\hat{y} = \argmax_{y'}[f(X)]_y'$ is a class prediction with associated posterior probability $\hat{p}= \max_{y'}[f(X)]_y'$.
+It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.
+As a metric of calibration *error*, it holds that the lower, the better calibrated a model is.
+For valid model comparisons, ensure to use the same keyword arguments.
 ## How to Use
 -->
 ### Inputs
 <!---
 *List all input arguments in the format below*
 *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 -->
 ## Limitations and Bias
 <!---
 *Note any known limitations or biases that the metric has, with links and references if possible.*
 -->
+See [3],[4] and [5].
 ## Citation
 [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
 [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
 [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
+[6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
 ## Further References
 *Add any useful further references.*

app.py CHANGED Viewed

@@ -3,4 +3,5 @@ from evaluate.utils import launch_gradio_widget
 module = evaluate.load("jordyvl/ece")
-launch_gradio_widget(module)


3
4
5	module = evaluate.load("jordyvl/ece")
6	+ launch_gradio_widget(module)
7	+

ece.py CHANGED Viewed

@@ -13,6 +13,8 @@
 # limitations under the License.
 """TODO: Add a description here."""
 import evaluate
 import datasets
 import numpy as np
@@ -29,7 +31,8 @@ year={2022}
 # TODO: Add description of the module here
 _DESCRIPTION = """\
-This new module is designed to solve this great ML task and is crafted with a lot of care.
 """
@@ -41,6 +44,9 @@ Args:
         should be a string with tokens separated by spaces.
     references: list of reference for each prediction. Each
         reference should be a string with tokens separated by spaces.
 Returns:
     accuracy: description of the first score,
     another_score: description of the second score,
@@ -55,14 +61,50 @@ Examples:
 """
 # TODO: Define external resources urls if needed
-BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
-# TODO
-def bin_idx_dd(P, bins):
-    oneDbins = np.digitize(P, bins) - 1  # since bins contains extra righmost&leftmost bins
     # Tie-breaking to the left for rightmost bin
     # Using `digitize`, values that fall on an edge are put in the right bin.
     # For the rightmost bin, we want values equal to the right
@@ -72,7 +114,7 @@ def bin_idx_dd(P, bins):
         # Find the rounding precision
         dedges_min = np.diff(bins).min()
         if dedges_min == 0:
-            raise ValueError('The smallest edge difference is numerically 0.')
         decimal = int(-np.log10(dedges_min)) + 6
@@ -87,48 +129,49 @@ def bin_idx_dd(P, bins):
 def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
-    binnumbers = bin_idx_dd(np.expand_dims(P, 0), bins)[0]
     result = np.empty([len(bins)], float)
-    result.fill(np.nan)
-    flatcount = np.bincount(binnumbers, None)
     a = flatcount.nonzero()
-    if statistic == 'mean':
-        flatsum = np.bincount(binnumbers, y_correct)
         result[a] = flatsum[a] / flatcount[a]
-    return result, bins, binnumbers + 1  # fix for what happens in bin_idx_dd
-def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
     """
     y_correct: binary (N x 1)
     P: normalized (N x 1) either max or per class
-    Summary: weighted average over the accuracy/confidence difference of equal-range bins
     """
-    # defaults:
-    if bins is None:
-        n_bins = n_bins
-        bin_range = [0, 1]
-        bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1)
-        # expected; equal range binning
-    else:
-        n_bins = len(bins) - 1
-        bin_range = [min(bins), max(bins)]
-    # average bin probability #55 for bin 50-60; mean per bin
-    calibrated_acc = bins[1:]  # right/upper bin edges
-    # calibrated_acc = bin_centers(bins)
     empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
     bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
     anindices = bin_numbers - 1  # reduce bin counts; left edge; indexes right BY DEFAULT
     # Expected calibration error
-    if p < np.inf:  # Lp-CE
         CE = np.average(
             abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
             weights=weights_ece,  # weighted average 1/binfreq
@@ -138,11 +181,14 @@ def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
     return CE
-def top_CE(Y, P, **kwargs):
-    y_correct = (Y == np.argmax(P, -1)).astype(int)
-    p_max = np.max(P, -1)
-    top_CE = CE_estimate(y_correct, p_max, **kwargs)  # can choose n_bins and norm
-    return top_CE
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
@@ -157,9 +203,18 @@ class ECE(evaluate.EvaluationModule):
     4. apply L^p norm distance and weights
     """
-    #have to add to initialization here?
-    #create bins using the params
-    #create proxy
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
@@ -170,15 +225,17 @@ class ECE(evaluate.EvaluationModule):
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
-            features=datasets.Features({
-                'predictions': datasets.Value('float32'),
-                'references': datasets.Value('int64'),
-            }),
             # Homepage of the module for documentation
-            homepage="http://module.homepage", #https://huggingface.co/spaces/jordyvl/ece
             # Additional links to the codebase or references
             codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
-            reference_urls=["http://path.to.reference.url/new_module"]
         )
     def _download_and_prepare(self, dl_manager):
@@ -188,20 +245,24 @@ class ECE(evaluate.EvaluationModule):
     def _compute(self, predictions, references):
         """Returns the scores"""
-        ECE = top_CE(references, predictions)
         return {
             "ECE": ECE,
         }
 def test_ECE():
-    N = 10 #10 instances
-    K = 5 #5 class problem
-    def random_mc_instance(concentration=1):
-        reference = np.argmax(np.random.dirichlet(([concentration for _ in range(K)])),-1)
-        prediction = np.random.dirichlet(([concentration for _ in range(K)])) #probabilities
-        #OH #return np.eye(K)[np.argmax(reference,-1)]
         return reference, prediction
     references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
@@ -210,5 +271,10 @@ def test_ECE():
     res = ECE()._compute(predictions, references)
     print(f"ECE: {res['ECE']}")
-if __name__ == '__main__':
-    test_ECE()

 # limitations under the License.
 """TODO: Add a description here."""
+# https://huggingface.co/spaces/jordyvl/ece
 import evaluate
 import datasets
 import numpy as np
 # TODO: Add description of the module here
 _DESCRIPTION = """\
+This new module is designed to evaluate the calibration of a probabilistic classifier.
+More concretely, we provide a binned empirical estimator of top-1 calibration error. [1]
 """
         should be a string with tokens separated by spaces.
     references: list of reference for each prediction. Each
         reference should be a string with tokens separated by spaces.
 Returns:
     accuracy: description of the first score,
     another_score: description of the second score,
 """
 # TODO: Define external resources urls if needed
+BAD_WORDS_URL = ""
+# Discretization and binning
+def create_bins(n_bins=10, scheme="equal-range", bin_range=None, P=None):
+    assert scheme in [
+        "equal-range",
+        "equal-masss",
+    ], f"This binning scheme {scheme} is not implemented yet"
+    if bin_range is None:
+        if P is None:
+            bin_range = [0, 1]  # no way to know range
+        else:
+            if scheme == "equal-range":
+                bin_range = [min(P), max(P)]
+    if scheme == "equal-range":
+        bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1)  # equal range
+        # bins = np.tile(np.linspace(bin_range[0], bin_range[1], n_bins + 1), (n_classes,1))
+    elif scheme == "equal-mass":
+        assert P.size >= n_bins, "Fewer points than bins"
+        # assume global equal mass binning; not discriminated per class
+        P = P.flatten()
+        # split sorted probabilities into groups of approx equal size
+        groups = np.array_split(np.sort(P), n_bins)
+        bin_upper_edges = list()
+        # rightmost entry per equal size group
+        for cur_group in range(n_bins - 1):
+            bin_upper_edges += [max(groups[cur_group])]
+        bin_upper_edges += [np.inf]  # always +1 for right edges
+        bins = np.array(bin_upper_edges)
+    return bins
+def discretize_into_bins(P, bins):
+    oneDbins = np.digitize(P, bins) - 1  # since bins contains extra righmost & leftmost bins
+    # Fix to scipy.binned_dd_statistic:
     # Tie-breaking to the left for rightmost bin
     # Using `digitize`, values that fall on an edge are put in the right bin.
     # For the rightmost bin, we want values equal to the right
         # Find the rounding precision
         dedges_min = np.diff(bins).min()
         if dedges_min == 0:
+            raise ValueError("The smallest edge difference is numerically 0.")
         decimal = int(-np.log10(dedges_min)) + 6
 def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
+    bin_assignments = discretize_into_bins(np.expand_dims(P, 0), bins)[0]
     result = np.empty([len(bins)], float)
+    result.fill(np.nan)  # cannot assume each bin will have observations
+    flatcount = np.bincount(bin_assignments, None)
     a = flatcount.nonzero()
+    if statistic == "mean":
+        flatsum = np.bincount(bin_assignments, y_correct)
         result[a] = flatsum[a] / flatcount[a]
+    return result, bins, bin_assignments + 1  # fix for what happens in discretize_into_bins
+def bin_calibrated_accuracy(bins, proxy="upper-edge"):
+    assert proxy in ["center", "upper-edge"], f"Unsupported proxy{proxy}"
+    if proxy == "upper-edge":
+        return bins[1:]
+    if proxy == "center":
+        return bins[:-1] + np.diff(bins) / 2
+def CE_estimate(y_correct, P, bins=None, p=1, proxy="upper-edge"):
     """
     y_correct: binary (N x 1)
     P: normalized (N x 1) either max or per class
+    Summary: weighted average over the accuracy/confidence difference of discrete bins of prediction probability
     """
+    n_bins = len(bins) - 1
+    bin_range = [min(bins), max(bins)]
+    # average bin probability #55 for bin 50-60, mean per bin; or right/upper bin edges
+    calibrated_acc = bin_calibrated_accuracy(bins, proxy="upper-edge")
     empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
     bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
     anindices = bin_numbers - 1  # reduce bin counts; left edge; indexes right BY DEFAULT
     # Expected calibration error
+    if p < np.inf:  # L^p-CE
         CE = np.average(
             abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
             weights=weights_ece,  # weighted average 1/binfreq
     return CE
+def top_1_CE(Y, P, **kwargs):
+    y_correct = (Y == np.argmax(P, -1)).astype(int)  # create condition y = ŷ € [K]
+    p_max = np.max(P, -1)  # create p̂ as top-1 softmax probability € [0,1]
+    bins = create_bins(
+        n_bins=kwargs["n_bins"], bin_range=kwargs["bin_range"], scheme=kwargs["scheme"], P=p_max
+    )
+    return CE_estimate(y_correct, p_max, bins=bins, proxy=kwargs["proxy"])
 @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
     4. apply L^p norm distance and weights
     """
+    # have to add to initialization here?
+    # create bins using the params
+    # create proxy
+    def __init__(self, n_bins=10, bin_range=None, scheme="equal-range", proxy="upper-edge", p=1):
+        super().__init__(self)
+        self.bin_range = bin_range
+        self.n_bins = n_bins
+        self.scheme = scheme
+        self.proxy = proxy
+        self.p = p
     def _info(self):
         # TODO: Specifies the evaluate.EvaluationModuleInfo object
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,
             # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("float32"),
+                    "references": datasets.Value("int64"),
+                }
+            ),
             # Homepage of the module for documentation
+            homepage="https://huggingface.co/spaces/jordyvl/ece",
             # Additional links to the codebase or references
             codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
+            reference_urls=["http://path.to.reference.url/new_module"],
         )
     def _download_and_prepare(self, dl_manager):
     def _compute(self, predictions, references):
         """Returns the scores"""
+        ECE = top_1_CE(references, predictions)
         return {
             "ECE": ECE,
         }
 def test_ECE():
+    N = 10  # N evaluation instances {(x_i,y_i)}_{i=1}^N
+    K = 5  # K class problem
+    def random_mc_instance(concentration=1, onehot=False):
+        reference = np.argmax(
+            np.random.dirichlet(([concentration for _ in range(K)])), -1
+        )  # class targets
+        prediction = np.random.dirichlet(([concentration for _ in range(K)]))  # probabilities
+        if onehot:
+            reference = np.eye(K)[np.argmax(reference, -1)]
         return reference, prediction
     references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
     res = ECE()._compute(predictions, references)
     print(f"ECE: {res['ECE']}")
+if __name__ == "__main__":
+    test_ECE()
+#        if scheme == "equal-mass":
+# raise AssertionError("Need to calculate based on P") #so cannot instantiate yet