File size: 2,503 Bytes
08e6339
 
c2da874
 
 
08e6339
 
 
 
c2da874
08e6339
c2da874
08e6339
c2da874
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c45aec
 
 
c2da874
6c45aec
 
 
 
 
 
 
 
c2da874
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
title: Negbleurt
emoji: 🌖
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 3.38.0
app_file: app.py
pinned: false
license: mit
---
# Metric Card for NegBLEURT


## Metric Description

NegBLEURT is the negation-aware version of the BLEURT metric. It can be used to evaluate generated text against a reference.  
 BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations and the CANNOT negation awareness dataset.

## How to Use

At minimum, this metric requires predictions and references as inputs.

```python
>>> negBLEURT = evaluate.load('tum-nlp/negbleurt')
>>> predictions = ["Ray Charles is a legend.", "Ray Charles isn’t legendary."]
>>> references = ["Ray Charles is legendary.", "Ray Charles is legendary."]
>>> results = negBLEURT.compute(predictions=predictions, references=references)
>>> print(results)
    {'negBLERUT': [0.8409, 0.2601]}
```


### Inputs
- **predictions: list of predictions to score. Each prediction should be a string.
- **references: list of references, one for each prediction. Each reference should be a string
- **batch_size (optional): batch_size for model inference. Default is 16
### Output Values
- **negBLEURT**(list of `float`): NegBLEURT scores. Values usually range between 0 and 1 where 1 indicates a perfect prediction and 0 indicates a poor fit.
Output Example(s):
```python
{'negBLERUT': [0.8409, 0.2601]}
```
This metric outputs a dictionary, containing the negBLEURT score.


## Limitations and Bias
This metric is based on BERT (Devlin et al. 2018) and as such inherits its biases and weaknesses. It was trained in an negation aware setting, and thus, overcomes BERT issues with negation awareness.

Currently, NegBLEURT is only available in English.
## Citation
Please cite our [INLG 2023 paper](https://arxiv.org/abs/2307.13989), if you use our metric. 
**BibTeX:**
```bibtex
@misc{anschütz2023correct,
      title={This is not correct! Negation-aware Evaluation of Language Generation Systems}, 
      author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
      year={2023},
      eprint={2307.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
## Further References
- The original [NegBLEURT GitHub repo](https://github.com/MiriUll/negation_aware_evaluation)