Fairseq
Catalan
Portuguese
fdelucaf commited on
Commit
2590dd7
1 Parent(s): 0c57d3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -16
README.md CHANGED
@@ -33,7 +33,8 @@ metrics:
33
 
34
  ## Model description
35
 
36
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets, which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
37
 
38
  ## Intended uses and limitations
39
 
@@ -63,6 +64,11 @@ translated = translator.translate_batch([tokenized[0]])
63
  print(tokenizer.detokenize(translated[0][0]['tokens']))
64
  ```
65
 
 
 
 
 
 
66
  ## Training
67
 
68
  ### Training data
@@ -133,11 +139,12 @@ The model was trained for a total of 17.000 updates. Weights were saved every 10
133
 
134
  ### Variable and metrics
135
 
136
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets
137
 
138
  ### Evaluation results
139
 
140
- Below are the evaluation results on the machine translation from Catalan to Portuguese compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
141
 
142
  | Test set | SoftCatalà | Google Translate |mt-aina-ca-pt|
143
  |----------------------|------------|------------------|---------------|
@@ -149,29 +156,34 @@ Below are the evaluation results on the machine translation from Catalan to Port
149
  ## Additional information
150
 
151
  ### Author
152
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
153
 
154
- ### Contact information
155
- For further information, please send an email to langtech@bsc.es.
156
 
157
  ### Copyright
158
- Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
159
 
160
- ### Licensing information
161
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
162
 
163
  ### Funding
164
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat
165
-
166
- ## Limitations and Bias
167
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
168
 
169
  ### Disclaimer
170
 
171
  <details>
172
  <summary>Click to expand</summary>
173
 
174
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
175
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
176
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
 
 
 
 
 
 
 
 
177
  </details>
 
33
 
34
  ## Model description
35
 
36
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
37
+ which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
38
 
39
  ## Intended uses and limitations
40
 
 
64
  print(tokenizer.detokenize(translated[0][0]['tokens']))
65
  ```
66
 
67
+ ## Limitations and bias
68
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
69
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
70
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
71
+
72
  ## Training
73
 
74
  ### Training data
 
139
 
140
  ### Variable and metrics
141
 
142
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
143
 
144
  ### Evaluation results
145
 
146
+ Below are the evaluation results on the machine translation from Catalan to Portuguese
147
+ compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
148
 
149
  | Test set | SoftCatalà | Google Translate |mt-aina-ca-pt|
150
  |----------------------|------------|------------------|---------------|
 
156
  ## Additional information
157
 
158
  ### Author
159
+ The Language Technologies Unit from Barcelona Supercomputing Center.
160
 
161
+ ### Contact
162
+ For further information, please send an email to <langtech@bsc.es>.
163
 
164
  ### Copyright
165
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
166
 
167
+ ### License
168
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
169
 
170
  ### Funding
171
+ This work was funded by [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
 
 
 
172
 
173
  ### Disclaimer
174
 
175
  <details>
176
  <summary>Click to expand</summary>
177
 
178
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
179
+
180
+ Be aware that the model may have biases and/or any other undesirable distortions.
181
+
182
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
183
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
184
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
185
+
186
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
187
+ be liable for any results arising from the use made by third parties.
188
+
189
  </details>