danieldeutsch
commited on
Commit
•
6653cdc
1
Parent(s):
8e6e99d
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,271 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
# MetricX-23
|
5 |
+
|
6 |
+
*This is not an officially supported Google product.*
|
7 |
+
|
8 |
+
**GitHub repository: [https://github.com/google-research/metricx](https://github.com/google-research/metricx)**
|
9 |
+
|
10 |
+
This repository contains the MetricX-23 models,
|
11 |
+
a family of models for automatic evaluation of translations that were proposed
|
12 |
+
in the WMT'23 Metrics Shared Task submission
|
13 |
+
[MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task](https://aclanthology.org/2023.wmt-1.63/).
|
14 |
+
The models were trained in [T5X](https://github.com/google-research/t5x) and
|
15 |
+
then converted for use in PyTorch.
|
16 |
+
|
17 |
+
## Available Models
|
18 |
+
There are 6 models available on HuggingFace that vary in the number of
|
19 |
+
parameters and whether or not the model is reference-based or reference-free
|
20 |
+
(also known as quality estimation, or QE):
|
21 |
+
|
22 |
+
* [MetricX-23-XXL](https://huggingface.co/google/metricx-23-large-v2p0)
|
23 |
+
* [MetricX-23-XL](https://huggingface.co/google/metricx-23-xl-v2p0)
|
24 |
+
* [MetricX-23-Large](https://huggingface.co/google/metricx-23-xxl-v2p0)
|
25 |
+
* [MetricX-23-QE-XXL](https://huggingface.co/google/metricx-23-qe-large-v2p0)
|
26 |
+
* [MetricX-23-QE-XL](https://huggingface.co/google/metricx-23-qe-xl-v2p0)
|
27 |
+
* [MetricX-23-QE-Large](https://huggingface.co/google/metricx-23-qe-xxl-v2p0)
|
28 |
+
|
29 |
+
We recommend using the XXL model versions for the best agreement with human
|
30 |
+
judgments of translation quality, the Large versions for best speed, and the
|
31 |
+
XL for an intermediate use case.
|
32 |
+
|
33 |
+
|
34 |
+
## Changes to the WMT'23 Submission
|
35 |
+
|
36 |
+
These models available here are most similar to the primary submission to the WMT'23 Metrics
|
37 |
+
Shared Task. They are initialized with [mT5](https://aclanthology.org/2021.naacl-main.41/)
|
38 |
+
then fine-tuned on a combination of direct assessment and MQM data. However,
|
39 |
+
we made some changes that make these models different from the WMT'23 submissions.
|
40 |
+
|
41 |
+
First, the models are trained to regress the actual MQM score rather than a
|
42 |
+
normalized score between 0 and 1. **That means the output from the MetricX-23
|
43 |
+
models is a score in the range [0, 25] where lower is better (i.e., it predicts
|
44 |
+
an error score).**
|
45 |
+
|
46 |
+
Second, these models were trained with a larger variety of synthetic data that
|
47 |
+
makes them more robust to translation edge cases like over- and undertranslation,
|
48 |
+
described in more detail in the following section.
|
49 |
+
|
50 |
+
### Synthetic Data
|
51 |
+
|
52 |
+
In order for our MetricX models to learn to identify certain types of bad
|
53 |
+
translations that are not sufficiently (or at all) represented in the regular
|
54 |
+
training data, we created synthetic examples and mixed them in during training.
|
55 |
+
The synthetic training data was generated from the DA datasets ranging from
|
56 |
+
WMT15 to WMT21 (~ 43 language pairs). In most cases, the synthetic examples have
|
57 |
+
the candidate translation manipulated so as to turn it into a bad translation
|
58 |
+
with a specific issue commonly unrecognized by learned metrics.
|
59 |
+
|
60 |
+
The table below provides an overview of the various failure modes that we
|
61 |
+
considered, including brief descriptions of how we prepared the synthetic data
|
62 |
+
to address them.
|
63 |
+
|
64 |
+
| Failure mode | Synthetic example description |
|
65 |
+
| ----------- | ----------- |
|
66 |
+
| Undertranslation | Candidate translation with an arbitrary sentence removed (if multi-sentence); alternatively, candidate with a certain proportion of words removed from the end. |
|
67 |
+
| Overtranslation | Candidate translation duplicated (with space in between). |
|
68 |
+
| Fluent but unrelated translation | Arbitrary reference of a similar length from the dataset. |
|
69 |
+
| Gibberish | Text of a similar length as the reference, generated by sampling words from the reference translation vocabulary (built from all references in the data). |
|
70 |
+
| Missing punctuation | Reference translation with the end punctuation removed (11 punctuation symbols considered). |
|
71 |
+
| Latin instead of Chinese/Japanese or Hindi/Bengali punctuation | Candidate translation with the language-specific punctuation symbol at the end replaced with the Latin equivalent (e.g., "." instead of "。" or "।"); alternatively, the punctuation symbol is replaced with the Latin equivalent in the reference, keeping the correct one in the candidate. |
|
72 |
+
| Reference-matching translation | Reference translation copied as the candidate translation (unlike the rest of the synthetic data, these examples are meant to train the metric to predict a perfect score for candidates matching the reference). |
|
73 |
+
|
74 |
+
Examples from the first 4 categories were assigned a label corresponding to the
|
75 |
+
worst score on the given rating scale (e.g., 25 when mixed with MQM training
|
76 |
+
data), whereas the reference-matching translation examples are assigned the best
|
77 |
+
score (e.g., 0 when used with MQM data). The missing/incorrect punctuation
|
78 |
+
examples were labeled with a score slightly worse than perfect.
|
79 |
+
|
80 |
+
Note that some of the synthetic datasets are only meaningful in the
|
81 |
+
reference-based scenario, and we thus excluded them when training a QE variant
|
82 |
+
of MetricX. These are the Latin-vs-special punctuation and the
|
83 |
+
reference-matching translation examples.
|
84 |
+
|
85 |
+
Most of the synthetic training sets were created using stratified sampling
|
86 |
+
across target languages, taking 500 examples per target language. One exception
|
87 |
+
is the missing punctuation set, which used a stratified sample across different
|
88 |
+
punctuation symbols instead.
|
89 |
+
|
90 |
+
When training MetricX, a small proportion of the synthetic examples was mixed
|
91 |
+
with the regular training examples. During the first-stage fine-tuning on DA
|
92 |
+
data, each synthetic training set constituted between 0.1% and 1% of all
|
93 |
+
training examples, whereas in the second-stage fine-tuning on MQM data we used
|
94 |
+
an even smaller proportion, around 0.05%.
|
95 |
+
|
96 |
+
As for evaluating the effect of the synthetic training data on the model's
|
97 |
+
performance, the DEMETR challenge set - which we originally used to evaluate the
|
98 |
+
models submitted to the WMT23 Metrics Shared Task - was not adequate anymore. We
|
99 |
+
therefore created a new DEMETR-style test set based on the WMT22 DA data, with
|
100 |
+
examples constructed analogically to the synthetic training examples, as
|
101 |
+
described above. This test set helped us determine the right proportions of
|
102 |
+
synthetic data for fine-tuning in order to make MetricX robust for the failure
|
103 |
+
modes in consideration, without sacrificing the system- and segment-level
|
104 |
+
correlations with human ratings.
|
105 |
+
|
106 |
+
## Usage
|
107 |
+
|
108 |
+
The code for using MetricX models can be found at [https://github.com/google-research/metricx](https://github.com/google-research/metricx).
|
109 |
+
The repository contains example prediction scripts, described below.
|
110 |
+
|
111 |
+
The `metricx23/predict.py` script contains an example for how to run inference
|
112 |
+
on the models.
|
113 |
+
|
114 |
+
### Reference-Based
|
115 |
+
Example usage for a reference-based model:
|
116 |
+
|
117 |
+
```bash
|
118 |
+
python -m metricx23.predict \
|
119 |
+
--tokenizer google/mt5-xl \
|
120 |
+
--model_name_or_path google/metricx-23-xl-v2p0 \
|
121 |
+
--max_input_length 1024 \
|
122 |
+
--batch_size 1 \
|
123 |
+
--input_file input.jsonl \
|
124 |
+
--output_file output.jsonl
|
125 |
+
```
|
126 |
+
|
127 |
+
`input.jsonl` is expected to have 1 serialized JSON object per line with
|
128 |
+
`"reference"` and `"hypothesis"` fields. The output jsonl will be parallel
|
129 |
+
to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.
|
130 |
+
|
131 |
+
Note that the model was trained with a maximum input length of 1024 tokens, so
|
132 |
+
significantly increasing that value may lead to unpredictable behavior.
|
133 |
+
|
134 |
+
### Reference-Free
|
135 |
+
Example usage for a reference-free model:
|
136 |
+
|
137 |
+
```bash
|
138 |
+
python -m metricx23.predict \
|
139 |
+
--tokenizer google/mt5-xl \
|
140 |
+
--model_name_or_path google/metricx-23-qe-xl-v2p0 \
|
141 |
+
--max_input_length 1024 \
|
142 |
+
--batch_size 1 \
|
143 |
+
--input_file input.jsonl \
|
144 |
+
--output_file output.jsonl \
|
145 |
+
--qe
|
146 |
+
```
|
147 |
+
|
148 |
+
`input.jsonl` is expected to have 1 serialized JSON object per line with
|
149 |
+
`"source"` and `"hypothesis"` fields. The output jsonl will be parallel
|
150 |
+
to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.
|
151 |
+
|
152 |
+
|
153 |
+
## Meta-Evaluation
|
154 |
+
The `metricx23/evaluate.py` script contains code to calculate various correlations
|
155 |
+
between the MetricX-23 scores and MQM ratings of translation quality using the
|
156 |
+
[MT Metrics Eval](https://github.com/google-research/mt-metrics-eval) library.
|
157 |
+
|
158 |
+
Example usage:
|
159 |
+
|
160 |
+
```bash
|
161 |
+
python -m metricx23.evaluate \
|
162 |
+
--dataset wmt22 \
|
163 |
+
--lp en-de \
|
164 |
+
--input_file input.jsonl \
|
165 |
+
--output_file output.json
|
166 |
+
```
|
167 |
+
|
168 |
+
`input.jsonl` is expected to have one JSON object serialized per line.
|
169 |
+
Each JSON object is expected to contain 4 fields:
|
170 |
+
|
171 |
+
* `"system_id"`: The name of the system that generated the translation.
|
172 |
+
* `"segment_id"`: The 0-based index of the corresponding segment in the MT
|
173 |
+
Metrics Eval data.
|
174 |
+
* `"label"`: The ground-truth translation quality score (with higher is better).
|
175 |
+
* `"prediction"`: The model predicted translation quality score (with lower is
|
176 |
+
better; the script negates the scores so higher is better).
|
177 |
+
|
178 |
+
The script will calculate the 4 agreement/correlations that were used in the
|
179 |
+
WMT'23 Shared Task. Below are the results for the MetricX-23 models on the
|
180 |
+
WMT'22 Metrics Shared Task data:
|
181 |
+
|
182 |
+
English-German:
|
183 |
+
|
184 |
+
| Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
|
185 |
+
| ----------- | ----------- | ----------- | ----------- | ----------- |
|
186 |
+
| MetricX-23-XXL | 0.795 | 0.835 | 0.546 | 0.619 |
|
187 |
+
| MetricX-23-XL | 0.756 | 0.813 | 0.540 | 0.605 |
|
188 |
+
| MetricX-23-Large | 0.769 | 0.759 | 0.507 | 0.595 |
|
189 |
+
| MetricX-23-QE-XXL | 0.769 | 0.830 | 0.490 | 0.606 |
|
190 |
+
| MetricX-23-QE-XL | 0.718 | 0.684 | 0.421 | 0.594 |
|
191 |
+
| MetricX-23-QE-Large | 0.744 | 0.671 | 0.387 | 0.579 |
|
192 |
+
|
193 |
+
English-Russian:
|
194 |
+
|
195 |
+
| Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
|
196 |
+
| ----------- | ----------- | ----------- | ----------- | ----------- |
|
197 |
+
| MetricX-23-XXL | 0.905 | 0.943 | 0.477 | 0.609 |
|
198 |
+
| MetricX-23-XL | 0.876 | 0.906 | 0.498 | 0.589 |
|
199 |
+
| MetricX-23-Large | 0.876 | 0.841 | 0.474 | 0.569 |
|
200 |
+
| MetricX-23-QE-XXL | 0.895 | 0.940 | 0.470 | 0.602 |
|
201 |
+
| MetricX-23-QE-XL | 0.848 | 0.861 | 0.415 | 0.570 |
|
202 |
+
| MetricX-23-QE-Large | 0.819 | 0.778 | 0.411 | 0.551 |
|
203 |
+
|
204 |
+
Chinese-English:
|
205 |
+
|
206 |
+
| Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
|
207 |
+
| ----------- | ----------- | ----------- | ----------- | ----------- |
|
208 |
+
| MetricX-23-XXL | 0.868 | 0.919 | 0.605 | 0.551 |
|
209 |
+
| MetricX-23-XL | 0.868 | 0.924 | 0.584 | 0.543 |
|
210 |
+
| MetricX-23-Large | 0.857 | 0.919 | 0.555 | 0.539 |
|
211 |
+
| MetricX-23-QE-XXL | 0.857 | 0.928 | 0.573 | 0.544 |
|
212 |
+
| MetricX-23-QE-XL | 0.802 | 0.879 | 0.546 | 0.529 |
|
213 |
+
| MetricX-23-QE-Large | 0.758 | 0.904 | 0.522 | 0.529 |
|
214 |
+
|
215 |
+
|
216 |
+
The `metricx23/evaluate_wmt23.py` script re-calculates the average correlation
|
217 |
+
score that was used to rank submissions from the
|
218 |
+
[WMT'23 Shared Task](https://www2.statmt.org/wmt23/pdf/2023.wmt-1.51.pdf).
|
219 |
+
|
220 |
+
Example usage:
|
221 |
+
|
222 |
+
```bash
|
223 |
+
python -m metricx23.evaluate_wmt23 \
|
224 |
+
--en_de predictions_ende.jsonl \
|
225 |
+
--he_en predictions_heen.jsonl \
|
226 |
+
--zh_en predictions_zhen.jsonl \
|
227 |
+
--output_file output.json
|
228 |
+
```
|
229 |
+
|
230 |
+
Each of the 3 input files is expected to be in the same format as described
|
231 |
+
above. Each file should correspond to running inference on each of the language
|
232 |
+
pairs from the WMT'23 dataset.
|
233 |
+
|
234 |
+
The results for each of the models is the following:
|
235 |
+
|
236 |
+
| Model | Average Correlation |
|
237 |
+
| ----------- | ----------- |
|
238 |
+
| MetricX-23-XXL | 0.812 |
|
239 |
+
| MetricX-23-XL | 0.813 |
|
240 |
+
| MetricX-23-Large | 0.794 |
|
241 |
+
| MetricX-23-QE-XXL | 0.797 |
|
242 |
+
| MetricX-23-QE-XL | 0.767 |
|
243 |
+
| MetricX-23-QE-Large | 0.762 |
|
244 |
+
|
245 |
+
|
246 |
+
## Citation
|
247 |
+
If you use MetricX-23 in your research, please cite the following publication:
|
248 |
+
|
249 |
+
```bibtex
|
250 |
+
@inproceedings{juraska-etal-2023-metricx,
|
251 |
+
title = {{MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task}},
|
252 |
+
author = "Juraska, Juraj and
|
253 |
+
Finkelstein, Mara and
|
254 |
+
Deutsch, Daniel and
|
255 |
+
Siddhant, Aditya and
|
256 |
+
Mirzazadeh, Mehdi and
|
257 |
+
Freitag, Markus",
|
258 |
+
editor = "Koehn, Philipp and
|
259 |
+
Haddow, Barry and
|
260 |
+
Kocmi, Tom and
|
261 |
+
Monz, Christof",
|
262 |
+
booktitle = "Proceedings of the Eighth Conference on Machine Translation",
|
263 |
+
month = dec,
|
264 |
+
year = "2023",
|
265 |
+
address = "Singapore",
|
266 |
+
publisher = "Association for Computational Linguistics",
|
267 |
+
url = "https://aclanthology.org/2023.wmt-1.63",
|
268 |
+
doi = "10.18653/v1/2023.wmt-1.63",
|
269 |
+
pages = "756--767",
|
270 |
+
}
|
271 |
+
```
|