PyTorch
Chinese
Catalan
m2m_100
xixianliao commited on
Commit
bcfaaf3
โ€ข
1 Parent(s): 3c59e9f

Upload model

Browse files
Files changed (1) hide show
  1. README.md +215 -0
README.md CHANGED
@@ -1,3 +1,218 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-ZH_Parallel_Corpus
5
+ language:
6
+ - zh
7
+ - ca
8
+ base_model:
9
+ - facebook/m2m100_1.2B
10
  ---
11
+ ## Projecte Ainaโ€™s Chinese-Catalan machine translation model
12
+
13
+ ## Table of Contents
14
+ <details>
15
+ <summary>Click to expand</summary>
16
+
17
+ - [Model description](#model-description)
18
+ - [Intended uses and limitations](#intended-uses-and-limitations)
19
+ - [How to use](#how-to-use)
20
+ - [Limitations and bias](#limitations-and-bias)
21
+ - [Training](#training)
22
+ - [Evaluation](#evaluation)
23
+ - [Additional information](#additional-information)
24
+
25
+ </details>
26
+
27
+
28
+ ## Model description
29
+
30
+ This machine translation model is built upon the foundation of M2M100 1.2B.
31
+ It is trained on a combination of Catalan-Chinese datasets
32
+ totalling 94.187.858 sentence pairs. 113.305 sentence pairs were parallel data collected from the web, while the remaining 94.074.553 sentence pairs
33
+ were parallel synthetic data created using the
34
+ [Aina Project's Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca) and the [Aina Project's English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
35
+ The model was evaluated on the Flores, NTREX, and Projecte Aina's Catalan-Chinese evaluation datasets.
36
+
37
+ ## Intended uses and limitations
38
+
39
+ You can use this model for machine translation from simplified Chinese to Catalan.
40
+
41
+ ## How to use
42
+
43
+ ### Usage
44
+
45
+ Translate a sentence using python
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
48
+
49
+ model_id = "projecte-aina/aina-translator-zh-ca-v2"
50
+
51
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
52
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
53
+
54
+ sentence = "ๆฌข่ฟŽๆฅๅˆฐ Aina ้กน็›ฎ๏ผ"
55
+
56
+ input_ids = tokenizer(sentence, return_tensors="pt").input_ids
57
+ output_ids = model.generate(input_ids, max_length=200, num_beams=5)
58
+
59
+ generated_translation= tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
60
+ print(generated_translation)
61
+ #Benvingut al projecte Aina!
62
+ ```
63
+
64
+
65
+ ## Limitations and bias
66
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
67
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
68
+
69
+ ## Training
70
+
71
+ ### Training data
72
+
73
+ The Catalan-Chinese data collected from the web was a combination of the following datasets:
74
+
75
+ | Dataset | Sentences before cleaning |
76
+ |-------------------|----------------|
77
+ | OpenSubtitles | 139.300 |
78
+ | WikiMatrix | 90.643 |
79
+ | Wikipedia | 68.623|
80
+ | **Total** | **298.566** |
81
+
82
+ 94.074.553 sentence pairs of synthetic parallel data were created from the following Spanish-Chinese datasets and English-Chinese datasets:
83
+
84
+ **Spanish-Chinese:**
85
+
86
+ | Dataset | Sentences before cleaning |
87
+ |-------------------|----------------|
88
+ | NLLB |24.051.233|
89
+ | UNPC | 17.599.223 |
90
+ | MultiUN | 9.847.770 |
91
+ | OpenSubtitles | 9.319.658 |
92
+ | MultiParaCrawl | 3.410.087 |
93
+ | MultiCCAligned | 3.006.694 |
94
+ | WikiMatrix | 1.214.322 |
95
+ | News Commentary | 375.982 |
96
+ | Tatoeba | 9.404 |
97
+ | **Total** | **68.834.373** |
98
+
99
+ **English-Chinese:**
100
+
101
+ | Dataset | Sentences before cleaning |
102
+ |-------------------|----------------|
103
+ | NLLB |71.383.325|
104
+ | CCAligned | 15.181.415 |
105
+ | Paracrawl | 14.170.869|
106
+ | WikiMatrix | 2.595.119|
107
+ | **Total** | **103.330.728** |
108
+
109
+
110
+ ### Training procedure
111
+
112
+ ### Data preparation
113
+
114
+ The Chinese side of all datasets were first processed using the [Hanzi Identifier](https://github.com/tsroten/hanzidentifier) to detect Traditional Chinese, which was subsequently converted to Simplified Chinese using [OpenCC](https://github.com/BYVoid/OpenCC).
115
+
116
+ All data was then filtered according to two specific criteria:
117
+
118
+ - Alignment: sentence level alignments were calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) and sentence pairs with a score below 0.75 were discarded.
119
+
120
+ - Language identification: the probability of being the target language was calculated using [Lingua.py](https://github.com/pemistahl/lingua-py) and sentences with a language probability score below 0.5 were discarded.
121
+
122
+ Next, Spanish data was translated into Catalan using the Aina Project's [Spanish-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-es-ca), while English data was translated into Catalan using the Aina Project's [English-Catalan machine translation model](https://huggingface.co/projecte-aina/aina-translator-en-ca).
123
+
124
+ The filtered and translated datasets are then concatenated and deduplicated to form a final corpus of 94.187.858.
125
+
126
+
127
+ #### Training
128
+
129
+ The training was executed on NVIDIA GPUs utilizing the Hugging Face Transformers framework.
130
+ The model was trained for 244.500 updates.
131
+ Weights were saved every 500 updates.
132
+
133
+ ## Evaluation
134
+
135
+ ### Variable and metrics
136
+
137
+ Below are the evaluation results on [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200),
138
+ [NTREX](https://github.com/MicrosoftTranslator/NTREX), and Projecte Aina's Catalan-Chinese test sets, compared to Google Translate for the ZH-CA direction. The evaluation was conducted [`tower-eval`](https://github.com/deep-spin/tower-eval) following the standard setting (beam search with beam size 5, limiting the translation length to 200 tokens). We report the following metrics:
139
+
140
+ - BLEU: Sacrebleu implementation, version:2.4.0.
141
+ - ChrF: Sacrebleu implementation.
142
+ - Comet: Model checkpoint: "Unbabel/wmt22-comet-da".
143
+ - Comet-kiwi: Model checkpoint: "Unbabel/wmt22-cometkiwi-da".
144
+
145
+
146
+ ### Evaluation results
147
+
148
+ Below are the evaluation results on the machine translation from Chinese to Catalan compared to [Google Translate](https://translate.google.com/):
149
+
150
+
151
+ #### Flores200-dev
152
+
153
+ | | Bleu โ†‘ | ChrF โ†‘ | Comet โ†‘ | Comet-kiwi โ†‘ |
154
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
155
+ | aina-translator-zh-ca-v2 | 26.74 | 54.49 | **0.86** | **0.82** |
156
+ | Google Translate | **27.71** | **55.37** | **0.86** | 0.81 |
157
+
158
+
159
+ #### Flores200-devtest
160
+
161
+
162
+ | | Bleu โ†‘ | ChrF โ†‘ | Comet โ†‘ | Comet-kiwi โ†‘ |
163
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
164
+ | aina-translator-zh-ca-v2 | 27.17 | 55.02 | **0.86** | **0.81** |
165
+ | Google Translate | **27.47** | **55.51** | **0.86** | **0.81** |
166
+
167
+
168
+ #### NTREX
169
+
170
+
171
+ | | Bleu โ†‘ | ChrF โ†‘ | Comet โ†‘ | Comet-kiwi โ†‘ |
172
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
173
+ | aina-translator-zh-ca-v2 | 22.43 | 50.65 | **0.83** | **0.79** |
174
+ | Google Translate | **23.49** | **51.29** | **0.83** | **0.79** |
175
+
176
+
177
+ #### Projecte Aina's Catalan-Chinese evaluation dataset
178
+
179
+ | | Bleu โ†‘ | ChrF โ†‘ | Comet โ†‘ | Comet-kiwi โ†‘ |
180
+ |:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
181
+ | aina-translator-zh-ca-v2 | **29.21** | 57.41 | **0.87** | **0.82** |
182
+ | Google Translate | 28.86 | **57.73** | **0.87** | **0.82** |
183
+
184
+
185
+ ## Additional information
186
+
187
+ ### Author
188
+ The Language Technologies Unit from Barcelona Supercomputing Center.
189
+
190
+ ### Contact
191
+ For further information, please send an email to <langtech@bsc.es>.
192
+
193
+ ### Copyright
194
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
195
+
196
+ ### License
197
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
198
+
199
+ ### Funding
200
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
201
+
202
+ ### Disclaimer
203
+
204
+ <details>
205
+ <summary>Click to expand</summary>
206
+
207
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
208
+
209
+ Be aware that the model may have biases and/or any other undesirable distortions.
210
+
211
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
212
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
213
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
214
+
215
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
216
+ be liable for any results arising from the use made by third parties.
217
+
218
+ </details>