Fairseq
Catalan
Italian
fdelucaf commited on
Commit
3979675
1 Parent(s): 4115540

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -32
README.md CHANGED
@@ -1,32 +1,20 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s Catalan-Italian machine translation model
5
 
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
-
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Italian datasets, which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +34,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-ca-it", revision="main")
50
 
51
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
52
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
@@ -56,6 +44,10 @@ translated = translator.translate_batch([tokenized[0]])
56
  print(tokenizer.detokenize(translated[0][0]['tokens']))
57
  ```
58
 
 
 
 
 
59
  ## Training
60
 
61
  ### Training data
@@ -79,7 +71,10 @@ The model was trained on a combination of the following datasets:
79
 
80
  ### Data preparation
81
 
82
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
83
 
84
 
85
  #### Tokenization
@@ -124,7 +119,7 @@ We use the BLEU score for evaluation on the [Flores-101](https://github.com/face
124
 
125
  Below are the evaluation results on the machine translation from Catalan to Italian compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
126
 
127
- | Test set | SoftCatalà | Google Translate |mt-aina-ca-it|
128
  |----------------------|------------|------------------|---------------|
129
  | Flores 101 dev | 24,3 | **28,5** | 26,1 |
130
  | Flores 101 devtest |24,7 | **29,1** | 26,3 |
@@ -136,20 +131,17 @@ Below are the evaluation results on the machine translation from Catalan to Ital
136
  ### Author
137
  Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
138
 
139
- ### Contact information
140
  For further information, send an email to <langtech@bsc.es>
141
 
142
  ### Copyright
143
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
144
 
145
- ### Licensing information
146
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
147
 
148
  ### Funding
149
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project] (https://projecteaina.cat/).
150
-
151
- ## Limitations and Bias
152
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
153
 
154
  ### Disclaimer
155
 
 
1
  ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-IT_Parallel_Corpus
5
+ language:
6
+ - ca
7
+ - it
8
+ metrics:
9
+ - bleu
10
+ library_name: fairseq
11
  ---
12
  ## Projecte Aina’s Catalan-Italian machine translation model
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination
17
+ of Catalan-Italian datasets, which after filtering and cleaning comprised 9.482.927 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
34
  import ctranslate2
35
  import pyonmttok
36
  from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-it", revision="main")
38
 
39
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 
44
  print(tokenizer.detokenize(translated[0][0]['tokens']))
45
  ```
46
 
47
+ ## Limitations and bias
48
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
49
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
50
+
51
  ## Training
52
 
53
  ### Training data
 
71
 
72
  ### Data preparation
73
 
74
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less
75
+ than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
76
+ The filtered datasets are then concatenated to form a final corpus of 9.482.927 and before training the punctuation is normalized
77
+ using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
78
 
79
 
80
  #### Tokenization
 
119
 
120
  Below are the evaluation results on the machine translation from Catalan to Italian compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
121
 
122
+ | Test set | SoftCatalà | Google Translate | aina-translator-ca-it |
123
  |----------------------|------------|------------------|---------------|
124
  | Flores 101 dev | 24,3 | **28,5** | 26,1 |
125
  | Flores 101 devtest |24,7 | **29,1** | 26,3 |
 
131
  ### Author
132
  Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center.
133
 
134
+ ### Contact
135
  For further information, send an email to <langtech@bsc.es>
136
 
137
  ### Copyright
138
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
139
 
140
+ ### License
141
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
142
 
143
  ### Funding
144
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
 
 
 
145
 
146
  ### Disclaimer
147