Spaces:
Running
Running
Update Space (evaluate main: 05209ece)
Browse files
README.md
CHANGED
@@ -10,6 +10,22 @@ pinned: false
|
|
10 |
tags:
|
11 |
- evaluate
|
12 |
- metric
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
---
|
14 |
|
15 |
# Metric Card for Google BLEU
|
|
|
10 |
tags:
|
11 |
- evaluate
|
12 |
- metric
|
13 |
+
description: >-
|
14 |
+
The BLEU score has some undesirable properties when used for single
|
15 |
+
sentences, as it was designed to be a corpus measure. We therefore
|
16 |
+
use a slightly different score for our RL experiments which we call
|
17 |
+
the 'GLEU score'. For the GLEU score, we record all sub-sequences of
|
18 |
+
1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
|
19 |
+
compute a recall, which is the ratio of the number of matching n-grams
|
20 |
+
to the number of total n-grams in the target (ground truth) sequence,
|
21 |
+
and a precision, which is the ratio of the number of matching n-grams
|
22 |
+
to the number of total n-grams in the generated output sequence. Then
|
23 |
+
GLEU score is simply the minimum of recall and precision. This GLEU
|
24 |
+
score's range is always between 0 (no matches) and 1 (all match) and
|
25 |
+
it is symmetrical when switching output and target. According to
|
26 |
+
our experiments, GLEU score correlates quite well with the BLEU
|
27 |
+
metric on a corpus level but does not have its drawbacks for our per
|
28 |
+
sentence reward objective.
|
29 |
---
|
30 |
|
31 |
# Metric Card for Google BLEU
|