fdelucaf commited on
Commit
b1fc2b6
1 Parent(s): 4b74637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -45
README.md CHANGED
@@ -1,32 +1,19 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
4
- ## Aina Project's Catalan-Spanish machine translation model
5
-
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
 
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets, up to 92 million sentences. Additionally, the model is evaluated on several public datasecomprising 5 different domains (general, adminstrative, technology, biomedical, and news).
 
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +33,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-ca-es", revision="main")
50
 
51
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
52
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
@@ -56,6 +43,10 @@ translated = translator.translate_batch([tokenized[0]])
56
  print(tokenizer.detokenize(translated[0][0]['tokens']))
57
  ```
58
 
 
 
 
 
59
  ## Training
60
 
61
  ### Training data
@@ -80,14 +71,17 @@ The was trained on a combination of the following datasets:
80
 
81
  ### Data preparation
82
 
83
- All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata) and cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
 
84
 
85
- Before training, the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
86
 
87
 
88
  #### Tokenization
89
 
90
- All data is tokenized using sentencepiece, with 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
91
 
92
  #### Hyperparameters
93
 
@@ -115,19 +109,26 @@ The following hyperparamenters were set on the Fairseq toolkit:
115
  | Dropout | 0.1 |
116
  | Label smoothing | 0.1 |
117
 
118
- The model was trained using shards of 10 million sentences, for a total of 13.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
 
119
 
120
  ## Evaluation
121
 
122
  ### Variable and metrics
123
 
124
- We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/), [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0), [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/), [wmt19 biomedical test set](), [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/)
 
 
 
 
 
125
 
126
  ### Evaluation results
127
 
128
- Below are the evaluation results on the machine translation from Catalan to Spanish compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
129
 
130
- | Test set | SoftCatalà | Google Translate | mt-aina-ca-es |
131
  |----------------------|------------|------------------|---------------|
132
  | Spanish Constitution | 70,7 | **77,1** | 75,5 |
133
  | United Nations | 78,1 | 84,3 | **86,3** |
@@ -139,34 +140,37 @@ Below are the evaluation results on the machine translation from Catalan to Span
139
  | aina_aapp_ca-es | 80,9 | 81,4 | **82,8** |
140
  | Average | 53,4 | 56,7 | **56,8** |
141
 
142
-
143
  ## Additional information
144
 
145
  ### Author
146
- Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
147
 
148
- ### Contact information
149
- For further information, send an email to aina@bsc.es
150
 
151
  ### Copyright
152
- Copyright (c) 2022 Text Mining Unit at Barcelona Supercomputing Center
153
-
154
 
155
- ### Licensing Information
156
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
157
 
158
  ### Funding
159
- This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
160
 
161
- ## Limitations and Bias
162
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
163
 
164
- ## Disclaimer
165
  <details>
166
  <summary>Click to expand</summary>
167
 
168
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
 
 
 
 
 
 
169
 
170
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
 
171
 
172
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - ca
5
+ - es
6
+ metrics:
7
+ - bleu
8
+ library_name: fairseq
9
  ---
10
+ ## Aina Project's Catalan-Spanish machine translation model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Model description
13
 
14
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Spanish datasets,
15
+ up to 92 million sentences. Additionally, the model is evaluated on several public datasecomprising 5 different domains (general, adminstrative, technology,
16
+ biomedical, and news).
17
 
18
  ## Intended uses and limitations
19
 
 
33
  import ctranslate2
34
  import pyonmttok
35
  from huggingface_hub import snapshot_download
36
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-es", revision="main")
37
 
38
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
39
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 
43
  print(tokenizer.detokenize(translated[0][0]['tokens']))
44
  ```
45
 
46
+ ## Limitations and bias
47
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
48
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
49
+
50
  ## Training
51
 
52
  ### Training data
 
71
 
72
  ### Data preparation
73
 
74
+ All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata) and
75
+ cleaned using the clean-corpus-n.pl script from [moses](https://github.com/moses-smt/mosesdecoder), allowing sentences between 5 and 150 words.
76
 
77
+ Before training, the punctuation is normalized using a modified version of the join-single-file.py script
78
+ from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
79
 
80
 
81
  #### Tokenization
82
 
83
+ All data is tokenized using sentencepiece, with 50 thousand token sentencepiece model learned from the combination of all filtered training data.
84
+ This model is included.
85
 
86
  #### Hyperparameters
87
 
 
109
  | Dropout | 0.1 |
110
  | Label smoothing | 0.1 |
111
 
112
+ The model was trained using shards of 10 million sentences, for a total of 13.000 updates.
113
+ Weights were saved every 1000 updates and reported results are the average of the last 6 checkpoints.
114
 
115
  ## Evaluation
116
 
117
  ### Variable and metrics
118
 
119
+ We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores),
120
+ [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/),
121
+ [United Nations](https://zenodo.org/record/3888414#.Y33-_tLMIW0),
122
+ [Cybersecurity](https://elrc-share.eu/repository/browse/cyber-mt-test-set/2bd93faab98c11ec9c1a00155d026706b96a490ed3e140f0a29a80a08c46e91e/),
123
+ [wmt19 biomedical test set](http://www.statmt.org/wmt19/biomedical-translation-task.html),
124
+ [wmt13 news test set](https://elrc-share.eu/repository/browse/catalan-wmt2013-machine-translation-shared-task-test-set/84a96139b98611ec9c1a00155d0267061a0aa1b62e2248e89aab4952f3c230fc/)
125
 
126
  ### Evaluation results
127
 
128
+ Below are the evaluation results on the machine translation from Catalan to Spanish compared to [Softcatalà](https://www.softcatala.org/)
129
+ and [Google Translate](https://translate.google.es/?hl=es):
130
 
131
+ | Test set | SoftCatalà | Google Translate | aina-translator-ca-es |
132
  |----------------------|------------|------------------|---------------|
133
  | Spanish Constitution | 70,7 | **77,1** | 75,5 |
134
  | United Nations | 78,1 | 84,3 | **86,3** |
 
140
  | aina_aapp_ca-es | 80,9 | 81,4 | **82,8** |
141
  | Average | 53,4 | 56,7 | **56,8** |
142
 
 
143
  ## Additional information
144
 
145
  ### Author
146
+ The Language Technologies Unit from Barcelona Supercomputing Center.
147
 
148
+ ### Contact
149
+ For further information, please send an email to <langtech@bsc.es>.
150
 
151
  ### Copyright
152
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
 
153
 
154
+ ### License
155
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
156
 
157
  ### Funding
158
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
159
 
160
+ ### Disclaimer
 
161
 
 
162
  <details>
163
  <summary>Click to expand</summary>
164
 
165
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
166
+
167
+ Be aware that the model may have biases and/or any other undesirable distortions.
168
+
169
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
170
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
171
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
172
 
173
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
174
+ be liable for any results arising from the use made by third parties.
175
 
176
+ </details>