lvwerra HF staff commited on
Commit
0275df4
1 Parent(s): 7896580

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +125 -3
  2. app.py +6 -0
  3. f1.py +124 -0
  4. requirements.txt +4 -0
README.md CHANGED
@@ -1,12 +1,134 @@
1
  ---
2
  title: F1
3
- emoji: 🚀
4
  colorFrom: blue
5
- colorTo: green
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: F1
3
+ emoji: 🤗
4
  colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for F1
16
+
17
+
18
+ ## Metric Description
19
+
20
+ The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
21
+ F1 = 2 * (precision * recall) / (precision + recall)
22
+
23
+
24
+ ## How to Use
25
+
26
+ At minimum, this metric requires predictions and references as input
27
+
28
+ ```python
29
+ >>> f1_metric = evaluate.load("f1")
30
+ >>> results = f1_metric.compute(predictions=[0, 1], references=[0, 1])
31
+ >>> print(results)
32
+ ["{'f1': 1.0}"]
33
+ ```
34
+
35
+
36
+ ### Inputs
37
+ - **predictions** (`list` of `int`): Predicted labels.
38
+ - **references** (`list` of `int`): Ground truth labels.
39
+ - **labels** (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
40
+ - **pos_label** (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
41
+ - **average** (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
42
+ - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
43
+ - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
44
+ - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
45
+ - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
46
+ - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
47
+ - **sample_weight** (`list` of `float`): Sample weights Defaults to None.
48
+
49
+
50
+ ### Output Values
51
+ - **f1**(`float` or `array` of `float`): F1 score or list of f1 scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher f1 scores are better.
52
+
53
+ Output Example(s):
54
+ ```python
55
+ {'f1': 0.26666666666666666}
56
+ ```
57
+ ```python
58
+ {'f1': array([0.8, 0.0, 0.0])}
59
+ ```
60
+
61
+ This metric outputs a dictionary, with either a single f1 score, of type `float`, or an array of f1 scores, with entries of type `float`.
62
+
63
+
64
+ #### Values from Popular Papers
65
+
66
+
67
+
68
+
69
+ ### Examples
70
+
71
+ Example 1-A simple binary example
72
+ ```python
73
+ >>> f1_metric = evaluate.load("f1")
74
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
75
+ >>> print(results)
76
+ {'f1': 0.5}
77
+ ```
78
+
79
+ Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
80
+ ```python
81
+ >>> f1_metric = evaluate.load("f1")
82
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
83
+ >>> print(round(results['f1'], 2))
84
+ 0.67
85
+ ```
86
+
87
+ Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
88
+ ```python
89
+ >>> f1_metric = evaluate.load("f1")
90
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
91
+ >>> print(round(results['f1'], 2))
92
+ 0.35
93
+ ```
94
+
95
+ Example 4-A multiclass example, with different values for the `average` input.
96
+ ```python
97
+ >>> predictions = [0, 2, 1, 0, 0, 1]
98
+ >>> references = [0, 1, 2, 0, 1, 2]
99
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="macro")
100
+ >>> print(round(results['f1'], 2))
101
+ 0.27
102
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="micro")
103
+ >>> print(round(results['f1'], 2))
104
+ 0.33
105
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
106
+ >>> print(round(results['f1'], 2))
107
+ 0.27
108
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average=None)
109
+ >>> print(results)
110
+ {'f1': array([0.8, 0. , 0. ])}
111
+ ```
112
+
113
+
114
+ ## Limitations and Bias
115
+
116
+
117
+
118
+ ## Citation(s)
119
+ ```bibtex
120
+ @article{scikit-learn,
121
+ title={Scikit-learn: Machine Learning in {P}ython},
122
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
123
+ and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
124
+ and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
125
+ Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
126
+ journal={Journal of Machine Learning Research},
127
+ volume={12},
128
+ pages={2825--2830},
129
+ year={2011}
130
+ }
131
+ ```
132
+
133
+
134
+ ## Further References
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("f1")
6
+ launch_gradio_widget(module)
f1.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """F1 metric."""
15
+
16
+ import datasets
17
+ from sklearn.metrics import f1_score
18
+
19
+ import evaluate
20
+
21
+
22
+ _DESCRIPTION = """
23
+ The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
24
+ F1 = 2 * (precision * recall) / (precision + recall)
25
+ """
26
+
27
+
28
+ _KWARGS_DESCRIPTION = """
29
+ Args:
30
+ predictions (`list` of `int`): Predicted labels.
31
+ references (`list` of `int`): Ground truth labels.
32
+ labels (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`, and the order of the labels if `average` is `None`. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
33
+ pos_label (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
34
+ average (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
35
+
36
+ - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
37
+ - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
38
+ - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
39
+ - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
40
+ - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
41
+ sample_weight (`list` of `float`): Sample weights Defaults to None.
42
+
43
+ Returns:
44
+ f1 (`float` or `array` of `float`): F1 score or list of f1 scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher f1 scores are better.
45
+
46
+ Examples:
47
+
48
+ Example 1-A simple binary example
49
+ >>> f1_metric = evaluate.load("f1")
50
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
51
+ >>> print(results)
52
+ {'f1': 0.5}
53
+
54
+ Example 2-The same simple binary example as in Example 1, but with `pos_label` set to `0`.
55
+ >>> f1_metric = evaluate.load("f1")
56
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0)
57
+ >>> print(round(results['f1'], 2))
58
+ 0.67
59
+
60
+ Example 3-The same simple binary example as in Example 1, but with `sample_weight` included.
61
+ >>> f1_metric = evaluate.load("f1")
62
+ >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
63
+ >>> print(round(results['f1'], 2))
64
+ 0.35
65
+
66
+ Example 4-A multiclass example, with different values for the `average` input.
67
+ >>> predictions = [0, 2, 1, 0, 0, 1]
68
+ >>> references = [0, 1, 2, 0, 1, 2]
69
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="macro")
70
+ >>> print(round(results['f1'], 2))
71
+ 0.27
72
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="micro")
73
+ >>> print(round(results['f1'], 2))
74
+ 0.33
75
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average="weighted")
76
+ >>> print(round(results['f1'], 2))
77
+ 0.27
78
+ >>> results = f1_metric.compute(predictions=predictions, references=references, average=None)
79
+ >>> print(results)
80
+ {'f1': array([0.8, 0. , 0. ])}
81
+ """
82
+
83
+
84
+ _CITATION = """
85
+ @article{scikit-learn,
86
+ title={Scikit-learn: Machine Learning in {P}ython},
87
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
88
+ and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
89
+ and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
90
+ Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
91
+ journal={Journal of Machine Learning Research},
92
+ volume={12},
93
+ pages={2825--2830},
94
+ year={2011}
95
+ }
96
+ """
97
+
98
+
99
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
100
+ class F1(evaluate.EvaluationModule):
101
+ def _info(self):
102
+ return evaluate.EvaluationModuleInfo(
103
+ description=_DESCRIPTION,
104
+ citation=_CITATION,
105
+ inputs_description=_KWARGS_DESCRIPTION,
106
+ features=datasets.Features(
107
+ {
108
+ "predictions": datasets.Sequence(datasets.Value("int32")),
109
+ "references": datasets.Sequence(datasets.Value("int32")),
110
+ }
111
+ if self.config_name == "multilabel"
112
+ else {
113
+ "predictions": datasets.Value("int32"),
114
+ "references": datasets.Value("int32"),
115
+ }
116
+ ),
117
+ reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html"],
118
+ )
119
+
120
+ def _compute(self, predictions, references, labels=None, pos_label=1, average="binary", sample_weight=None):
121
+ score = f1_score(
122
+ references, predictions, labels=labels, pos_label=pos_label, average=average, sample_weight=sample_weight
123
+ )
124
+ return {"f1": float(score) if score.size == 1 else score}
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ sklearn