Fairseq
Catalan
German
fdelucaf commited on
Commit
c37aa46
1 Parent(s): d12f0a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -35
README.md CHANGED
@@ -1,32 +1,20 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s Catalan-German machine translation model
5
 
6
- ## Table of Contents
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-use)
9
- - [How to Use](#how-to-use)
10
- - [Training](#training)
11
- - [Training data](#training-data)
12
- - [Training procedure](#training-procedure)
13
- - [Data Preparation](#data-preparation)
14
- - [Tokenization](#tokenization)
15
- - [Hyperparameters](#hyperparameters)
16
- - [Evaluation](#evaluation)
17
- - [Variable and Metrics](#variable-and-metrics)
18
- - [Evaluation Results](#evaluation-results)
19
- - [Additional Information](#additional-information)
20
- - [Author](#author)
21
- - [Contact Information](#contact-information)
22
- - [Copyright](#copyright)
23
- - [Licensing Information](#licensing-information)
24
- - [Funding](#funding)
25
- - [Disclaimer](#disclaimer)
26
-
27
  ## Model description
28
 
29
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets, which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
30
 
31
  ## Intended uses and limitations
32
 
@@ -46,7 +34,7 @@ Translate a sentence using python
46
  import ctranslate2
47
  import pyonmttok
48
  from huggingface_hub import snapshot_download
49
- model_dir = snapshot_download(repo_id="projecte-aina/mt-aina-ca-de", revision="main")
50
 
51
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
52
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
@@ -56,6 +44,10 @@ translated = translator.translate_batch([tokenized[0]])
56
  print(tokenizer.detokenize(translated[0][0]['tokens']))
57
  ```
58
 
 
 
 
 
59
  ## Training
60
 
61
  ### Training data
@@ -78,19 +70,24 @@ The model was trained on a combination of the following datasets:
78
  | Tilde | 3.434.091 | 3.434.091 |
79
  | **Total** | **7.427.843** | **6.258.272** |
80
 
81
- All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/). The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 
82
 
83
 
84
  ### Training procedure
85
 
86
  ### Data preparation
87
 
88
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
 
89
 
90
 
91
  #### Tokenization
92
 
93
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
94
 
95
  #### Hyperparameters
96
 
@@ -124,13 +121,14 @@ The model was trained for a total of 22.000 updates. Weights were saved every 10
124
 
125
  ### Variable and metrics
126
 
127
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets
128
 
129
  ### Evaluation results
130
 
131
- Below are the evaluation results on the machine translation from Catalan to German compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](https://translate.google.es/?hl=es):
 
132
 
133
- | Test set | SoftCatalà | Google Translate |mt-aina-ca-de|
134
  |----------------------|------------|------------------|---------------|
135
  | Flores 101 dev | 26,2 | **34,8** | 27,5 |
136
  | Flores 101 devtest |26,3 | **34,0** | 26,9 |
@@ -142,20 +140,17 @@ Below are the evaluation results on the machine translation from Catalan to Germ
142
  ### Author
143
  Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
144
 
145
- ### Contact information
146
  For further information, please send an email to langtech@bsc.es.
147
 
148
  ### Copyright
149
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
150
 
151
- ### Licensing information
152
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
153
 
154
  ### Funding
155
- This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project] (https://projecteaina.cat/).
156
-
157
- ## Limitations and Bias
158
- At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
159
 
160
  ### Disclaimer
161
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-DE_Parallel_Corpus
5
+ language:
6
+ - ca
7
+ - de
8
+ metrics:
9
+ - bleu
10
+ library_name: fairseq
11
  ---
12
  ## Projecte Aina’s Catalan-German machine translation model
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
17
+ which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
 
34
  import ctranslate2
35
  import pyonmttok
36
  from huggingface_hub import snapshot_download
37
+ model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-ca-de", revision="main")
38
 
39
  tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model")
40
  tokenized=tokenizer.tokenize("Benvingut al projecte Aina!")
 
44
  print(tokenizer.detokenize(translated[0][0]['tokens']))
45
  ```
46
 
47
+ ## Limitations and bias
48
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
49
+ However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
50
+
51
  ## Training
52
 
53
  ### Training data
 
70
  | Tilde | 3.434.091 | 3.434.091 |
71
  | **Total** | **7.427.843** | **6.258.272** |
72
 
73
+ All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
74
+ The Europarl and Tilde corpora are synthetic parallel corpora created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
75
 
76
 
77
  ### Training procedure
78
 
79
  ### Data preparation
80
 
81
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
82
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
83
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized
84
+ using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
85
 
86
 
87
  #### Tokenization
88
 
89
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
90
+ This model is included.
91
 
92
  #### Hyperparameters
93
 
 
121
 
122
  ### Variable and metrics
123
 
124
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
125
 
126
  ### Evaluation results
127
 
128
+ Below are the evaluation results on the machine translation from Catalan to German compared to [Softcatalà](https://www.softcatala.org/)
129
+ and [Google Translate](https://translate.google.es/?hl=es):
130
 
131
+ | Test set | SoftCatalà | Google Translate | aina-translator-ca-de |
132
  |----------------------|------------|------------------|---------------|
133
  | Flores 101 dev | 26,2 | **34,8** | 27,5 |
134
  | Flores 101 devtest |26,3 | **34,0** | 26,9 |
 
140
  ### Author
141
  Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center
142
 
143
+ ### Contact
144
  For further information, please send an email to langtech@bsc.es.
145
 
146
  ### Copyright
147
  Copyright Language Technologies Unit at Barcelona Supercomputing Center (2023)
148
 
149
+ ### License
150
  This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
151
 
152
  ### Funding
153
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
 
 
 
154
 
155
  ### Disclaimer
156