google_bleu / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: 05209ece)
f2b5e98
|
raw
history blame
9.27 kB
---
title: Google BLEU
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
The BLEU score has some undesirable properties when used for single
sentences, as it was designed to be a corpus measure. We therefore
use a slightly different score for our RL experiments which we call
the 'GLEU score'. For the GLEU score, we record all sub-sequences of
1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
compute a recall, which is the ratio of the number of matching n-grams
to the number of total n-grams in the target (ground truth) sequence,
and a precision, which is the ratio of the number of matching n-grams
to the number of total n-grams in the generated output sequence. Then
GLEU score is simply the minimum of recall and precision. This GLEU
score's range is always between 0 (no matches) and 1 (all match) and
it is symmetrical when switching output and target. According to
our experiments, GLEU score correlates quite well with the BLEU
metric on a corpus level but does not have its drawbacks for our per
sentence reward objective.
---
# Metric Card for Google BLEU
## Metric Description
The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score is designed to limit these undesirable properties when used for single sentences.
To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.
- **precision:** the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
- **recall:** the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence
The minimum value of precision and recall is then returned as the score.
## Intended Uses
This metric is generally used to evaluate machine translation models. It is especially used when scores of individual (prediction, reference) sentence pairs are needed, as opposed to when averaging over the (prediction, reference) scores for a whole corpus. That being said, it can also be used when averaging over the scores for a whole corpus.
Because it performs better on individual sentence pairs as compared to BLEU, Google BLEU has also been used in RL experiments.
## How to Use
This metric takes a list of predicted sentences, as well as a list of references.
```python
sentence1 = "the cat sat on the mat"
sentence2 = "the cat ate the mat"
google_bleu = evaluate.load("google_bleu")
result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
print(result)
>>> {'google_bleu': 0.3333333333333333}
```
### Inputs
- **predictions** (list of str): list of translations to score.
- **references** (list of list of str): list of lists of references for each translation.
- **tokenizer** : approach used for tokenizing `predictions` and `references`.
The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT. This can be replaced by any function that takes a string as input and returns a list of tokens as output.
- **min_len** (int): The minimum order of n-gram this function should extract. Defaults to 1.
- **max_len** (int): The maximum order of n-gram this function should extract. Defaults to 4.
### Output Values
This metric returns the following in a dict:
- **google_bleu** (float): google_bleu score
The output format is as follows:
```python
{'google_bleu': google_bleu score}
```
This metric can take on values from 0 to 1, inclusive. Higher scores are better, with 0 indicating no matches, and 1 indicating a perfect match.
Note that this score is symmetrical when switching output and target. This means that, given two sentences, `sentence1` and `sentence2`, whatever score is output when `sentence1` is the predicted sentence and `sencence2` is the reference sentence will be the same as when the sentences are swapped and `sentence2` is the predicted sentence while `sentence1` is the reference sentence. In code, this looks like:
```python
predictions = "the cat sat on the mat"
references = "the cat ate the mat"
google_bleu = evaluate.load("google_bleu")
result_a = google_bleu.compute(predictions=[predictions], references=[[references]])
result_b = google_bleu.compute(predictions=[predictions], references=[[references]])
print(result_a == result_b)
>>> True
```
#### Values from Popular Papers
### Examples
Example with one reference per sample:
```python
>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references)
>>> print(round(results["google_bleu"], 2))
0.44
```
Example with multiple references for the first sample:
```python
>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references)
>>> print(round(results["google_bleu"], 2))
0.61
```
Example with multiple references for the first sample, and with `min_len` adjusted to `2`, instead of the default `1`, which means that the function extracts n-grams of length `2`:
```python
>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
>>> print(round(results["google_bleu"], 2))
0.53
```
Example with multiple references for the first sample, with `min_len` adjusted to `2`, instead of the default `1`, and `max_len` adjusted to `6` instead of the default `4`:
```python
>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
>>> google_bleu = evaluate.load("google_bleu")
>>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
>>> print(round(results["google_bleu"], 2))
0.4
```
## Limitations and Bias
The GoogleBLEU metric does not come with a predefined tokenization function; previous versions simply used `split()` to split the input strings into tokens. Using a tokenizer such as the default one, `tokenizer_13a`, makes results more standardized and reproducible. The BLEU and sacreBLEU metrics also use this default tokenizer.
## Citation
```bibtex
@misc{wu2016googles,
title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean},
year={2016},
eprint={1609.08144},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Further References
- This Hugging Face implementation uses the [nltk.translate.gleu_score implementation](https://www.nltk.org/_modules/nltk/translate/gleu_score.html)