Fairseq
Catalan
Portuguese
AudreyVM commited on
Commit
6f6dcb4
1 Parent(s): ee0590b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -21
README.md CHANGED
@@ -13,8 +13,9 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
17
- which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
18
 
19
  ## Intended uses and limitations
20
 
@@ -54,28 +55,36 @@ However, we are well aware that our models may be biased. We intend to conduct r
54
 
55
  The model was trained on a combination of the following datasets:
56
 
57
- | Dataset | Sentences | Sentences after Cleaning|
58
- |-------------------|----------------|-------------------|
59
- | CCMatrix v1 | 12.674.684 | 3.765.459|
60
- | WikiMatrix | 358.873 | 317.649 |
61
- | GNOME | 5.211 | 1.752|
62
- | KDE4 | 166.208 | 117.828 |
63
- | OpenSubtitles | 384.142 | 235.604 |
64
- | GlobalVoices| 4.035 | 3.430|
65
- | Tatoeba | 754 | 723 |
66
- | Europarl | 1.692.106 | 1.631.989 |
67
-
68
- All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
 
 
 
 
 
 
 
69
  The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 
70
 
71
 
72
  ### Training procedure
73
 
74
  ### Data preparation
75
 
76
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
77
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
78
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
79
 
80
 
81
  #### Tokenization
@@ -115,7 +124,7 @@ The model was trained for a total of 17.000 updates. Weights were saved every 10
115
 
116
  ### Variable and metrics
117
 
118
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
119
 
120
  ### Evaluation results
121
 
@@ -124,10 +133,11 @@ compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](ht
124
 
125
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-pt |
126
  |----------------------|------------|------------------|---------------|
127
- | Flores 101 dev | 30,9 | **41,4** | 34,3 |
128
- | Flores 101 devtest |31,6 | **41,3** | 35,2 |
129
- | NTREX | 27,9 | **30,1** | 28,0 |
130
- | Average | 30,1 | **37,6** | 32,5 |
 
131
 
132
  ## Additional information
133
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising
17
+ both Catalan-Portuguese data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Portuguese corpora using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
18
+ This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.  
19
 
20
  ## Intended uses and limitations
21
 
 
55
 
56
  The model was trained on a combination of the following datasets:
57
 
58
+ | Datasets       | 
59
+ |----------------------|
60
+ | DGT |
61
+ |EU Bookshop |
62
+ | Europarl |
63
+ |Global Voices |
64
+ | GNOME |
65
+ |KDE 4 |
66
+ | Multi CCAligned |
67
+ | Multi Paracrawl |
68
+ | Multi UN |
69
+ | NLLB    |
70
+ | NTEU |
71
+ | Open Subtitles |
72
+ |Tatoeba |
73
+ |UNPC |
74
+ | WikiMatrix | 
75
+
76
+ All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/).
77
  The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
78
+ After all Catalan-Portuguese data had been collected, Spanish-Portuguese data was collected and the Spanish data translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
79
 
80
 
81
  ### Training procedure
82
 
83
  ### Data preparation
84
 
85
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
86
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
87
+ The filtered datasets are then concatenated to form the final corpus. Before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
88
 
89
 
90
  #### Tokenization
 
124
 
125
  ### Variable and metrics
126
 
127
+ We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), NTEU (unpublished), and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
128
 
129
  ### Evaluation results
130
 
 
133
 
134
  | Test set | SoftCatalà | Google Translate | aina-translator-ca-pt |
135
  |----------------------|------------|------------------|---------------|
136
+ | Flores 101 dev | 30,8 | **39,7** | 38,5 |
137
+ | Flores 101 devtest |31,5 | **39** | 39 |
138
+ | NTEU| 41,7 | 47,1 | **57,4** |
139
+ | NTREX | 27,9 | **30,2** | 28,9 |
140
+ | **Average** | 33 | 39 |**41** |
141
 
142
  ## Additional information
143