Fairseq
Italian
Catalan
fdelucaf commited on
Commit
d733360
1 Parent(s): 96d0516

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -43
README.md CHANGED
@@ -1,32 +1,20 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s Italian-Catalan machine translation model
5
-
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
 
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Italian datasets, which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +34,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-it-ca", revision="main")
50
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
51
  tokenized=tokenizer.tokenize("Benvenuto al progetto Aina!")
52
  translator = ctranslate2.Translator(model_dir)
@@ -77,12 +65,16 @@ The model was trained on a combination of the following datasets:
77
 
78
  ### Data preparation
79
 
80
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
81
 
82
 
83
  #### Tokenization
84
 
85
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
86
 
87
  #### Hyperparameters
88
 
@@ -116,13 +108,14 @@ The model was trained for a total of 19.000 updates. Weights were saved every 10
116
 
117
  ### Variable and metrics
118
 
119
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), and [NTREX](https://github.com/MicrosoftTranslator/NTREX) evaluation datasets.
120
 
121
  ### Evaluation results
122
 
123
- Below are the evaluation results on the machine translation from Italian to Catalan compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
124
 
125
- | Test set | SoftCatalà | Google Translate |mt-aina-it-ca|
126
  |----------------------|------------|------------------|---------------|
127
  | Flores 101 dev | 25,4 | **30,4** | 27,5 |
128
  | Flores 101 devtest |26,6 | **31,2** | 27,7 |
@@ -132,30 +125,34 @@ Below are the evaluation results on the machine translation from Italian to Cata
132
  ## Additional information
133
 
134
  ### Author
135
- Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
136
 
137
- ### Contact information
138
- For further information, send an email to <langtech@bsc.es>
139
 
140
  ### Copyright
141
- Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
142
 
143
- ### Licensing information
144
- This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
145
 
146
  ### Funding
147
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project] (https://projecteaina.cat/).
148
-
149
- ## Limitations and Bias
150
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
151
 
152
  ### Disclaimer
153
 
154
  <details>
155
  <summary>Click to expand</summary>
156
- The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
157
- When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
158
- In no event shall the owner and creator of the models (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.
159
- </details>
160
 
 
 
 
 
 
 
 
 
 
 
161
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-IT_Parallel_Corpus
5
+ language:
6
+ - it
7
+ - ca
8
+ metrics:
9
+ - bleu
10
+ library_name: fairseq
11
  ---
12
  ## Projecte Aina’s Italian-Catalan machine translation model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Italian datasets,
17
+ which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
34
  import ctranslate2
35
  import pyonmttok
36
  from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-it-ca", revision="main")
38
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
39
  tokenized=tokenizer.tokenize("Benvenuto al progetto Aina!")
40
  translator = ctranslate2.Translator(model_dir)
 
65
 
66
  ### Data preparation
67
 
68
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
69
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
70
+ The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized using
71
+ a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
72
 
73
 
74
  #### Tokenization
75
 
76
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
77
+ This model is included.
78
 
79
  #### Hyperparameters
80
 
 
108
 
109
  ### Variable and metrics
110
 
111
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
112
 
113
  ### Evaluation results
114
 
115
+ Below are the evaluation results on the machine translation from Italian to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
116
+ [Google Translate](https://translate.google.es/?hl=es):
117
 
118
+ | Test set | SoftCatalà | Google Translate | aina-translator-it-ca |
119
  |----------------------|------------|------------------|---------------|
120
  | Flores 101 dev | 25,4 | **30,4** | 27,5 |
121
  | Flores 101 devtest |26,6 | **31,2** | 27,7 |
 
125
  ## Additional information
126
 
127
  ### Author
128
+ The Language Technologies Unit from Barcelona Supercomputing Center.
129
 
130
+ ### Contact
131
+ For further information, please send an email to <langtech@bsc.es>.
132
 
133
  ### Copyright
134
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
135
 
136
+ ### License
137
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
138
 
139
  ### Funding
140
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
 
 
 
141
 
142
  ### Disclaimer
143
 
144
  <details>
145
  <summary>Click to expand</summary>
 
 
 
 
146
 
147
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
148
+
149
+ Be aware that the model may have biases and/or any other undesirable distortions.
150
+
151
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
152
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
153
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
154
+
155
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
156
+ be liable for any results arising from the use made by third parties.
157
 
158
+ </details>