danieldeutsch commited on
Commit
599c176
1 Parent(s): bc16a11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +268 -0
README.md CHANGED
@@ -1,3 +1,271 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # MetricX-23
5
+
6
+ *This is not an officially supported Google product.*
7
+
8
+ **GitHub repository: [https://github.com/google-research/metricx](https://github.com/google-research/metricx)**
9
+
10
+ This repository contains the MetricX-23 models,
11
+ a family of models for automatic evaluation of translations that were proposed
12
+ in the WMT'23 Metrics Shared Task submission
13
+ [MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task](https://aclanthology.org/2023.wmt-1.63/).
14
+ The models were trained in [T5X](https://github.com/google-research/t5x) and
15
+ then converted for use in PyTorch.
16
+
17
+ ## Available Models
18
+ There are 6 models available on HuggingFace that vary in the number of
19
+ parameters and whether or not the model is reference-based or reference-free
20
+ (also known as quality estimation, or QE):
21
+
22
+ * [MetricX-23-XXL](https://huggingface.co/google/metricx-23-large-v2p0)
23
+ * [MetricX-23-XL](https://huggingface.co/google/metricx-23-xl-v2p0)
24
+ * [MetricX-23-Large](https://huggingface.co/google/metricx-23-xxl-v2p0)
25
+ * [MetricX-23-QE-XXL](https://huggingface.co/google/metricx-23-qe-large-v2p0)
26
+ * [MetricX-23-QE-XL](https://huggingface.co/google/metricx-23-qe-xl-v2p0)
27
+ * [MetricX-23-QE-Large](https://huggingface.co/google/metricx-23-qe-xxl-v2p0)
28
+
29
+ We recommend using the XXL model versions for the best agreement with human
30
+ judgments of translation quality, the Large versions for best speed, and the
31
+ XL for an intermediate use case.
32
+
33
+
34
+ ## Changes to the WMT'23 Submission
35
+
36
+ These models available here are most similar to the primary submission to the WMT'23 Metrics
37
+ Shared Task. They are initialized with [mT5](https://aclanthology.org/2021.naacl-main.41/)
38
+ then fine-tuned on a combination of direct assessment and MQM data. However,
39
+ we made some changes that make these models different from the WMT'23 submissions.
40
+
41
+ First, the models are trained to regress the actual MQM score rather than a
42
+ normalized score between 0 and 1. **That means the output from the MetricX-23
43
+ models is a score in the range [0, 25] where lower is better (i.e., it predicts
44
+ an error score).**
45
+
46
+ Second, these models were trained with a larger variety of synthetic data that
47
+ makes them more robust to translation edge cases like over- and undertranslation,
48
+ described in more detail in the following section.
49
+
50
+ ### Synthetic Data
51
+
52
+ In order for our MetricX models to learn to identify certain types of bad
53
+ translations that are not sufficiently (or at all) represented in the regular
54
+ training data, we created synthetic examples and mixed them in during training.
55
+ The synthetic training data was generated from the DA datasets ranging from
56
+ WMT15 to WMT21 (~ 43 language pairs). In most cases, the synthetic examples have
57
+ the candidate translation manipulated so as to turn it into a bad translation
58
+ with a specific issue commonly unrecognized by learned metrics.
59
+
60
+ The table below provides an overview of the various failure modes that we
61
+ considered, including brief descriptions of how we prepared the synthetic data
62
+ to address them.
63
+
64
+ | Failure mode | Synthetic example description |
65
+ | ----------- | ----------- |
66
+ | Undertranslation | Candidate translation with an arbitrary sentence removed (if multi-sentence); alternatively, candidate with a certain proportion of words removed from the end. |
67
+ | Overtranslation | Candidate translation duplicated (with space in between). |
68
+ | Fluent but unrelated translation | Arbitrary reference of a similar length from the dataset. |
69
+ | Gibberish | Text of a similar length as the reference, generated by sampling words from the reference translation vocabulary (built from all references in the data). |
70
+ | Missing punctuation | Reference translation with the end punctuation removed (11 punctuation symbols considered). |
71
+ | Latin instead of Chinese/Japanese or Hindi/Bengali punctuation | Candidate translation with the language-specific punctuation symbol at the end replaced with the Latin equivalent (e.g., "." instead of "。" or "।"); alternatively, the punctuation symbol is replaced with the Latin equivalent in the reference, keeping the correct one in the candidate. |
72
+ | Reference-matching translation | Reference translation copied as the candidate translation (unlike the rest of the synthetic data, these examples are meant to train the metric to predict a perfect score for candidates matching the reference). |
73
+
74
+ Examples from the first 4 categories were assigned a label corresponding to the
75
+ worst score on the given rating scale (e.g., 25 when mixed with MQM training
76
+ data), whereas the reference-matching translation examples are assigned the best
77
+ score (e.g., 0 when used with MQM data). The missing/incorrect punctuation
78
+ examples were labeled with a score slightly worse than perfect.
79
+
80
+ Note that some of the synthetic datasets are only meaningful in the
81
+ reference-based scenario, and we thus excluded them when training a QE variant
82
+ of MetricX. These are the Latin-vs-special punctuation and the
83
+ reference-matching translation examples.
84
+
85
+ Most of the synthetic training sets were created using stratified sampling
86
+ across target languages, taking 500 examples per target language. One exception
87
+ is the missing punctuation set, which used a stratified sample across different
88
+ punctuation symbols instead.
89
+
90
+ When training MetricX, a small proportion of the synthetic examples was mixed
91
+ with the regular training examples. During the first-stage fine-tuning on DA
92
+ data, each synthetic training set constituted between 0.1% and 1% of all
93
+ training examples, whereas in the second-stage fine-tuning on MQM data we used
94
+ an even smaller proportion, around 0.05%.
95
+
96
+ As for evaluating the effect of the synthetic training data on the model's
97
+ performance, the DEMETR challenge set - which we originally used to evaluate the
98
+ models submitted to the WMT23 Metrics Shared Task - was not adequate anymore. We
99
+ therefore created a new DEMETR-style test set based on the WMT22 DA data, with
100
+ examples constructed analogically to the synthetic training examples, as
101
+ described above. This test set helped us determine the right proportions of
102
+ synthetic data for fine-tuning in order to make MetricX robust for the failure
103
+ modes in consideration, without sacrificing the system- and segment-level
104
+ correlations with human ratings.
105
+
106
+ ## Usage
107
+
108
+ The code for using MetricX models can be found at [https://github.com/google-research/metricx](https://github.com/google-research/metricx).
109
+ The repository contains example prediction scripts, described below.
110
+
111
+ The `metricx23/predict.py` script contains an example for how to run inference
112
+ on the models.
113
+
114
+ ### Reference-Based
115
+ Example usage for a reference-based model:
116
+
117
+ ```bash
118
+ python -m metricx23.predict \
119
+ --tokenizer google/mt5-xl \
120
+ --model_name_or_path google/metricx-23-xl-v2p0 \
121
+ --max_input_length 1024 \
122
+ --batch_size 1 \
123
+ --input_file input.jsonl \
124
+ --output_file output.jsonl
125
+ ```
126
+
127
+ `input.jsonl` is expected to have 1 serialized JSON object per line with
128
+ `"reference"` and `"hypothesis"` fields. The output jsonl will be parallel
129
+ to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.
130
+
131
+ Note that the model was trained with a maximum input length of 1024 tokens, so
132
+ significantly increasing that value may lead to unpredictable behavior.
133
+
134
+ ### Reference-Free
135
+ Example usage for a reference-free model:
136
+
137
+ ```bash
138
+ python -m metricx23.predict \
139
+ --tokenizer google/mt5-xl \
140
+ --model_name_or_path google/metricx-23-qe-xl-v2p0 \
141
+ --max_input_length 1024 \
142
+ --batch_size 1 \
143
+ --input_file input.jsonl \
144
+ --output_file output.jsonl \
145
+ --qe
146
+ ```
147
+
148
+ `input.jsonl` is expected to have 1 serialized JSON object per line with
149
+ `"source"` and `"hypothesis"` fields. The output jsonl will be parallel
150
+ to `input.jsonl` but additionally contain a `"prediction"` field with the predicted score.
151
+
152
+
153
+ ## Meta-Evaluation
154
+ The `metricx23/evaluate.py` script contains code to calculate various correlations
155
+ between the MetricX-23 scores and MQM ratings of translation quality using the
156
+ [MT Metrics Eval](https://github.com/google-research/mt-metrics-eval) library.
157
+
158
+ Example usage:
159
+
160
+ ```bash
161
+ python -m metricx23.evaluate \
162
+ --dataset wmt22 \
163
+ --lp en-de \
164
+ --input_file input.jsonl \
165
+ --output_file output.json
166
+ ```
167
+
168
+ `input.jsonl` is expected to have one JSON object serialized per line.
169
+ Each JSON object is expected to contain 4 fields:
170
+
171
+ * `"system_id"`: The name of the system that generated the translation.
172
+ * `"segment_id"`: The 0-based index of the corresponding segment in the MT
173
+ Metrics Eval data.
174
+ * `"label"`: The ground-truth translation quality score (with higher is better).
175
+ * `"prediction"`: The model predicted translation quality score (with lower is
176
+ better; the script negates the scores so higher is better).
177
+
178
+ The script will calculate the 4 agreement/correlations that were used in the
179
+ WMT'23 Shared Task. Below are the results for the MetricX-23 models on the
180
+ WMT'22 Metrics Shared Task data:
181
+
182
+ English-German:
183
+
184
+ | Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
185
+ | ----------- | ----------- | ----------- | ----------- | ----------- |
186
+ | MetricX-23-XXL | 0.795 | 0.835 | 0.546 | 0.619 |
187
+ | MetricX-23-XL | 0.756 | 0.813 | 0.540 | 0.605 |
188
+ | MetricX-23-Large | 0.769 | 0.759 | 0.507 | 0.595 |
189
+ | MetricX-23-QE-XXL | 0.769 | 0.830 | 0.490 | 0.606 |
190
+ | MetricX-23-QE-XL | 0.718 | 0.684 | 0.421 | 0.594 |
191
+ | MetricX-23-QE-Large | 0.744 | 0.671 | 0.387 | 0.579 |
192
+
193
+ English-Russian:
194
+
195
+ | Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
196
+ | ----------- | ----------- | ----------- | ----------- | ----------- |
197
+ | MetricX-23-XXL | 0.905 | 0.943 | 0.477 | 0.609 |
198
+ | MetricX-23-XL | 0.876 | 0.906 | 0.498 | 0.589 |
199
+ | MetricX-23-Large | 0.876 | 0.841 | 0.474 | 0.569 |
200
+ | MetricX-23-QE-XXL | 0.895 | 0.940 | 0.470 | 0.602 |
201
+ | MetricX-23-QE-XL | 0.848 | 0.861 | 0.415 | 0.570 |
202
+ | MetricX-23-QE-Large | 0.819 | 0.778 | 0.411 | 0.551 |
203
+
204
+ Chinese-English:
205
+
206
+ | Model | System-Level Accuracy | System-Level Pearson | Segment-Level Pearson | Segment-Level Pairwise Acc |
207
+ | ----------- | ----------- | ----------- | ----------- | ----------- |
208
+ | MetricX-23-XXL | 0.868 | 0.919 | 0.605 | 0.551 |
209
+ | MetricX-23-XL | 0.868 | 0.924 | 0.584 | 0.543 |
210
+ | MetricX-23-Large | 0.857 | 0.919 | 0.555 | 0.539 |
211
+ | MetricX-23-QE-XXL | 0.857 | 0.928 | 0.573 | 0.544 |
212
+ | MetricX-23-QE-XL | 0.802 | 0.879 | 0.546 | 0.529 |
213
+ | MetricX-23-QE-Large | 0.758 | 0.904 | 0.522 | 0.529 |
214
+
215
+
216
+ The `metricx23/evaluate_wmt23.py` script re-calculates the average correlation
217
+ score that was used to rank submissions from the
218
+ [WMT'23 Shared Task](https://www2.statmt.org/wmt23/pdf/2023.wmt-1.51.pdf).
219
+
220
+ Example usage:
221
+
222
+ ```bash
223
+ python -m metricx23.evaluate_wmt23 \
224
+ --en_de predictions_ende.jsonl \
225
+ --he_en predictions_heen.jsonl \
226
+ --zh_en predictions_zhen.jsonl \
227
+ --output_file output.json
228
+ ```
229
+
230
+ Each of the 3 input files is expected to be in the same format as described
231
+ above. Each file should correspond to running inference on each of the language
232
+ pairs from the WMT'23 dataset.
233
+
234
+ The results for each of the models is the following:
235
+
236
+ | Model | Average Correlation |
237
+ | ----------- | ----------- |
238
+ | MetricX-23-XXL | 0.812 |
239
+ | MetricX-23-XL | 0.813 |
240
+ | MetricX-23-Large | 0.794 |
241
+ | MetricX-23-QE-XXL | 0.797 |
242
+ | MetricX-23-QE-XL | 0.767 |
243
+ | MetricX-23-QE-Large | 0.762 |
244
+
245
+
246
+ ## Citation
247
+ If you use MetricX-23 in your research, please cite the following publication:
248
+
249
+ ```bibtex
250
+ @inproceedings{juraska-etal-2023-metricx,
251
+ title = {{MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task}},
252
+ author = "Juraska, Juraj and
253
+ Finkelstein, Mara and
254
+ Deutsch, Daniel and
255
+ Siddhant, Aditya and
256
+ Mirzazadeh, Mehdi and
257
+ Freitag, Markus",
258
+ editor = "Koehn, Philipp and
259
+ Haddow, Barry and
260
+ Kocmi, Tom and
261
+ Monz, Christof",
262
+ booktitle = "Proceedings of the Eighth Conference on Machine Translation",
263
+ month = dec,
264
+ year = "2023",
265
+ address = "Singapore",
266
+ publisher = "Association for Computational Linguistics",
267
+ url = "https://aclanthology.org/2023.wmt-1.63",
268
+ doi = "10.18653/v1/2023.wmt-1.63",
269
+ pages = "756--767",
270
+ }
271
+ ```