danieldeutsch commited on
Commit
a278c79
1 Parent(s): 9f93a72

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # MetricX-24
5
+
6
+ *This is not an officially supported Google product.*
7
+
8
+ **GitHub repository**: https://github.com/google-research/metricx
9
+
10
+ The repository contains the code for running inference on MetricX-24 models,
11
+ a family of models for automatic evaluation of translations that were proposed
12
+ in the WMT'24 Metrics Shared Task submission
13
+ [MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task](https://aclanthology.org/2024.wmt-1.35/).
14
+ The models were trained in [T5X](https://github.com/google-research/t5x) and
15
+ then converted for use in PyTorch.
16
+
17
+
18
+ ## Available Models
19
+
20
+ There are 3 MetricX-24 models available on Hugging Face that vary in the number
21
+ of parameters. Unlike the MetricX-23 models, the MetricX-24 models are all
22
+ hybrid models that can do both reference-based and reference-free (also known as
23
+ quality estimation, or QE) inference:
24
+
25
+ * [MetricX-24-Hybrid-XXL](https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6)
26
+ * [MetricX-24-Hybrid-XL](https://huggingface.co/google/metricx-24-hybrid-xl-v2p6)
27
+ * [MetricX-24-Hybrid-Large](https://huggingface.co/google/metricx-24-hybrid-large-v2p6)
28
+
29
+ We recommend using the XXL model versions for the best agreement with human
30
+ judgments of translation quality, the Large versions for best speed, and the
31
+ XL for an intermediate use case.
32
+
33
+
34
+ ## Changes to the WMT'24 Submission
35
+
36
+ The MetricX-24 models available here are most similar to the primary submission
37
+ to the WMT'24 Metrics Shared Task. They are initialized with
38
+ [mT5](https://aclanthology.org/2021.naacl-main.41/),
39
+ then fine-tuned on a combination of direct assessment and MQM data from
40
+ WMT'15-'22. However, we made a couple of small changes that make these models
41
+ different from the WMT'24 submissions.
42
+
43
+ First, the metric scores get automatically clipped at 0 and 25, to ensure they
44
+ are strictly in the [0, 25] range, as due to the nature of regression models,
45
+ the scores could otherwise sometimes fall outside the range.
46
+
47
+ Second, we included one additional type of synthetic training examples that
48
+ weren't ready in time for the official submission. These are examples of perfect
49
+ translations of multi-sentence segments, generated from the MQM data from
50
+ WMT'20-'22. The purpose of this category of synthetic data is to reduce the
51
+ model's bias against longer translations when the source segment and/or
52
+ reference are also long.
53
+
54
+
55
+ ## Model Performance
56
+
57
+ For comparison with the submissions to
58
+ [WMT'24 Metrics Shared Task](https://www2.statmt.org/wmt24/pdf/2024.wmt-1.2.pdf),
59
+ we provide an overview of the system- and segment-level correlation scores
60
+ between the MetricX-24 scores and MQM ratings of translation quality, as
61
+ calculated on the shared task's test sets:
62
+
63
+ | Model | Sys-Level SPA (en-de) | Seg-Level Acc (en-de) | Sys-Level SPA (en-es) | Seg-Level Acc (en-es) | Sys-Level SPA (ja-zh) | Seg-Level Acc (ja-zh) |
64
+ | -------------------------- | ----- | ----- | ----- | ----- | ----- | ----- |
65
+ | MetricX-24-Hybrid-XXL | 0.865 | 0.543 | 0.785 | 0.685 | 0.878 | 0.541 |
66
+ | MetricX-24-Hybrid-XL | 0.884 | 0.522 | 0.806 | 0.683 | 0.859 | 0.528 |
67
+ | MetricX-24-Hybrid-Large | 0.879 | 0.511 | 0.795 | 0.686 | 0.845 | 0.514 |
68
+ | MetricX-24-Hybrid-QE-XXL | 0.884 | 0.525 | 0.789 | 0.685 | 0.863 | 0.527 |
69
+ | MetricX-24-Hybrid-QE-XL | 0.879 | 0.502 | 0.774 | 0.683 | 0.849 | 0.509 |
70
+ | MetricX-24-Hybrid-QE-Large | 0.809 | 0.490 | 0.762 | 0.684 | 0.847 | 0.508 |
71
+
72
+ Below are the above correlation scores averaged, as used in the shared task to
73
+ determine the final ranking of the submissions:
74
+
75
+ | Model | Average Correlation |
76
+ | -------------------------- | ----- |
77
+ | MetricX-24-Hybrid-XXL | 0.716 |
78
+ | MetricX-24-Hybrid-XL | 0.714 |
79
+ | MetricX-24-Hybrid-Large | 0.705 |
80
+ | MetricX-24-Hybrid-QE-XXL | 0.712 |
81
+ | MetricX-24-Hybrid-QE-XL | 0.699 |
82
+ | MetricX-24-Hybrid-QE-Large | 0.683 |
83
+
84
+ NOTE: Since MetricX-24 models are hybrid models, MetricX-24-\<size\> and
85
+ MetricX-24-QE-\<size\> correspond to the same model, evaluated *with* and
86
+ *without* the references, respectively.
87
+
88
+
89
+ ## Citation
90
+
91
+ If you use MetricX-24 in your research, please cite the following publication:
92
+
93
+ ```bibtex
94
+ @inproceedings{juraska-etal-2024-metricx,
95
+ title = "{M}etric{X}-24: The {G}oogle Submission to the {WMT} 2024 Metrics Shared Task",
96
+ author = "Juraska, Juraj and
97
+ Deutsch, Daniel and
98
+ Finkelstein, Mara and
99
+ Freitag, Markus",
100
+ editor = "Haddow, Barry and
101
+ Kocmi, Tom and
102
+ Koehn, Philipp and
103
+ Monz, Christof",
104
+ booktitle = "Proceedings of the Ninth Conference on Machine Translation",
105
+ month = nov,
106
+ year = "2024",
107
+ address = "Miami, Florida, USA",
108
+ publisher = "Association for Computational Linguistics",
109
+ url = "https://aclanthology.org/2024.wmt-1.35",
110
+ pages = "492--504",
111
+ }
112
+ ```