lvwerra HF staff commited on
Commit
a51af17
1 Parent(s): af542ea

Update Space (evaluate main: 2c6d460a)

Browse files
Files changed (4) hide show
  1. README.md +111 -6
  2. app.py +7 -0
  3. poseval.py +113 -0
  4. requirements.txt +3 -0
README.md CHANGED
@@ -1,12 +1,117 @@
1
  ---
2
- title: Poseval
3
- emoji: 📈
4
- colorFrom: yellow
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 3.1.1
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: poseval
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
+ sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: >-
14
+ The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data
15
+ that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant
16
+ observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's
17
+ classification report to compute the scores.
18
  ---
19
 
20
+ # Metric Card for peqeval
21
+
22
+ ## Metric description
23
+
24
+ The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data (see e.g. [here](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)) that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to compute the scores.
25
+
26
+
27
+ ## How to use
28
+
29
+ Poseval produces labelling scores along with its sufficient statistics from a source against references.
30
+
31
+ It takes two mandatory arguments:
32
+
33
+ `predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
34
+
35
+ `references`: a list of lists of reference labels, i.e. the ground truth/target values.
36
+
37
+ It can also take several optional arguments:
38
+
39
+ `zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.
40
+
41
+
42
+ ```python
43
+ >>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
44
+ >>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
45
+ >>> poseval = evaluate.load("poseval")
46
+ >>> results = poseval.compute(predictions=predictions, references=references)
47
+ >>> print(list(results.keys()))
48
+ ['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
49
+ >>> print(results["accuracy"])
50
+ 0.8
51
+ >>> print(results["PROPN"]["recall"])
52
+ 0.5
53
+ ```
54
+
55
+ ## Output values
56
+
57
+ This metric returns a a classification report as a dictionary with a summary of scores for overall and per type:
58
+
59
+ Overall (weighted and macro avg):
60
+
61
+ `accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
62
+
63
+ `precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
64
+
65
+ `recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
66
+
67
+ `f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
68
+
69
+ Per type (e.g. `MISC`, `PER`, `LOC`,...):
70
+
71
+ `precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
72
+
73
+ `recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
74
+
75
+ `f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.
76
+
77
+
78
+ ## Examples
79
+
80
+ ```python
81
+ >>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
82
+ >>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
83
+ >>> poseval = evaluate.load("poseval")
84
+ >>> results = poseval.compute(predictions=predictions, references=references)
85
+ >>> print(list(results.keys()))
86
+ ['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
87
+ >>> print(results["accuracy"])
88
+ 0.8
89
+ >>> print(results["PROPN"]["recall"])
90
+ 0.5
91
+ ```
92
+
93
+ ## Limitations and bias
94
+
95
+ In contrast to [seqeval](https://github.com/chakki-works/seqeval), the poseval metric treats each token independently and computes the classification report over all concatenated sequences..
96
+
97
+
98
+ ## Citation
99
+
100
+ ```bibtex
101
+ @article{scikit-learn,
102
+ title={Scikit-learn: Machine Learning in {P}ython},
103
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
104
+ and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
105
+ and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
106
+ Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
107
+ journal={Journal of Machine Learning Research},
108
+ volume={12},
109
+ pages={2825--2830},
110
+ year={2011}
111
+ }
112
+ ```
113
+
114
+ ## Further References
115
+ - [README for seqeval at GitHub](https://github.com/chakki-works/seqeval)
116
+ - [Classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)
117
+ - [Issues with seqeval](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)
app.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("poseval")
6
+
7
+ launch_gradio_widget(module)
poseval.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2022 The HuggingFace Evaluate Authors.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ seqeval metric. """
15
+
16
+ from typing import Union
17
+
18
+ import datasets
19
+ from sklearn.metrics import classification_report
20
+
21
+ import evaluate
22
+
23
+
24
+ _CITATION = """\
25
+ @article{scikit-learn,
26
+ title={Scikit-learn: Machine Learning in {P}ython},
27
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
28
+ and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
29
+ and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
30
+ Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
31
+ journal={Journal of Machine Learning Research},
32
+ volume={12},
33
+ pages={2825--2830},
34
+ year={2011}
35
+ }
36
+ """
37
+
38
+ _DESCRIPTION = """\
39
+ The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data \
40
+ (see e.g. [here](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging))\
41
+ that is not in IOB format the poseval metric is an alternative. It treats each token in the dataset as independant \
42
+ observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's \
43
+ [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) \
44
+ to compute the scores.
45
+
46
+ """
47
+
48
+ _KWARGS_DESCRIPTION = """
49
+ Computes the poseval metric.
50
+
51
+ Args:
52
+ predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
53
+ references: List of List of reference labels (Ground truth (correct) target values)
54
+ zero_division: Which value to substitute as a metric value when encountering zero division. Should be on of 0, 1,
55
+ "warn". "warn" acts as 0, but the warning is raised.
56
+
57
+ Returns:
58
+ 'scores': dict. Summary of the scores for overall and per type
59
+ Overall (weighted and macro avg):
60
+ 'accuracy': accuracy,
61
+ 'precision': precision,
62
+ 'recall': recall,
63
+ 'f1': F1 score, also known as balanced F-score or F-measure,
64
+ Per type:
65
+ 'precision': precision,
66
+ 'recall': recall,
67
+ 'f1': F1 score, also known as balanced F-score or F-measure
68
+ Examples:
69
+
70
+ >>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
71
+ >>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
72
+ >>> poseval = evaluate.load("poseval")
73
+ >>> results = poseval.compute(predictions=predictions, references=references)
74
+ >>> print(list(results.keys()))
75
+ ['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
76
+ >>> print(results["accuracy"])
77
+ 0.8
78
+ >>> print(results["PROPN"]["recall"])
79
+ 0.5
80
+ """
81
+
82
+
83
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
84
+ class Poseval(evaluate.Metric):
85
+ def _info(self):
86
+ return evaluate.MetricInfo(
87
+ description=_DESCRIPTION,
88
+ citation=_CITATION,
89
+ homepage="https://scikit-learn.org",
90
+ inputs_description=_KWARGS_DESCRIPTION,
91
+ features=datasets.Features(
92
+ {
93
+ "predictions": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
94
+ "references": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
95
+ }
96
+ ),
97
+ codebase_urls=["https://github.com/scikit-learn/scikit-learn"],
98
+ )
99
+
100
+ def _compute(
101
+ self,
102
+ predictions,
103
+ references,
104
+ zero_division: Union[str, int] = "warn",
105
+ ):
106
+ report = classification_report(
107
+ y_true=[label for ref in references for label in ref],
108
+ y_pred=[label for pred in predictions for label in pred],
109
+ output_dict=True,
110
+ zero_division=zero_division,
111
+ )
112
+
113
+ return report
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ git+https://github.com/huggingface/evaluate@a45df1eb9996eec64ec3282ebe554061cb366388
2
+ datasets~=2.0
3
+ scikit-learn