File size: 4,944 Bytes
af542ea
a51af17
 
 
 
af542ea
29658e3
af542ea
 
a51af17
 
 
 
 
 
 
 
af542ea
 
a51af17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
title: poseval
emoji: 🤗 
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
  The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data 
  that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant 
  observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's
  classification report to compute the scores.
---

# Metric Card for peqeval

## Metric description

The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data (see e.g. [here](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)) that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to compute the scores.


## How to use 

Poseval produces labelling scores along with its sufficient statistics from a source against references.

It takes two mandatory arguments:

`predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.

`references`: a list of lists of reference labels, i.e. the ground truth/target values.

It can also take several optional arguments:

`zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.


```python
>>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
>>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
>>> poseval = evaluate.load("poseval")
>>> results = poseval.compute(predictions=predictions, references=references)
>>> print(list(results.keys()))
['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
>>> print(results["accuracy"])
0.8
>>> print(results["PROPN"]["recall"])
0.5
```

## Output values

This metric returns a a classification report as a dictionary with a summary of scores for overall and per type:

Overall (weighted and macro avg):

`accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
    
`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
    
`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.

`f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.

Per type (e.g. `MISC`, `PER`, `LOC`,...):

`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.

`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.

`f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.


## Examples 

```python
>>> predictions = [['INTJ', 'ADP', 'PROPN', 'NOUN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'VERB', 'SYM']]
>>> references = [['INTJ', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'INTJ', 'ADP', 'PROPN', 'PROPN', 'SYM']]
>>> poseval = evaluate.load("poseval")
>>> results = poseval.compute(predictions=predictions, references=references)
>>> print(list(results.keys()))
['ADP', 'INTJ', 'NOUN', 'PROPN', 'PUNCT', 'SYM', 'VERB', 'accuracy', 'macro avg', 'weighted avg']
>>> print(results["accuracy"])
0.8
>>> print(results["PROPN"]["recall"])
0.5
```

## Limitations and bias

In contrast to [seqeval](https://github.com/chakki-works/seqeval), the poseval metric treats each token independently and computes the classification report over all concatenated sequences..


## Citation

```bibtex
@article{scikit-learn,
 title={Scikit-learn: Machine Learning in {P}ython},
 author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
 journal={Journal of Machine Learning Research},
 volume={12},
 pages={2825--2830},
 year={2011}
}
```
    
## Further References 
- [README for seqeval at GitHub](https://github.com/chakki-works/seqeval)
- [Classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) 
- [Issues with seqeval](https://stackoverflow.com/questions/71327693/how-to-disable-seqeval-label-formatting-for-pos-tagging)