Fairseq
Portuguese
Catalan
AudreyVM commited on
Commit
7a91206
1 Parent(s): 5eb40f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -21
README.md CHANGED
@@ -13,8 +13,7 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
17
- which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -54,26 +53,31 @@ However, we are well aware that our models may be biased. We intend to conduct r
54
 
55
  The model was trained on a combination of the following datasets:
56
 
57
- | Dataset | Sentences | Sentences after Cleaning|
58
- |-------------------|----------------|-------------------|
59
- | CCMatrix v1 | 12.674.684 | 3.765.459|
60
- | WikiMatrix | 358.873 | 317.649 |
61
- | GNOME | 5.211 | 1.752|
62
- | KDE4 | 166.208 | 117.828 |
63
- | OpenSubtitles | 384.142 | 235.604 |
64
- | GlobalVoices| 4.035 | 3.430|
65
- | Tatoeba | 754 | 723 |
66
- | Europarl | 1.692.106 | 1.631.989 |
67
-
68
- All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
69
- The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
70
-
 
 
 
 
 
71
 
72
  ### Training procedure
73
 
74
  ### Data preparation
75
 
76
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
77
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
78
  The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
79
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
@@ -126,10 +130,11 @@ Below are the evaluation results on the machine translation from Portuguese to C
126
 
127
  | Test set | SoftCatalà | Google Translate | aina-translator-pt-ca |
128
  |----------------------|------------|------------------|---------------|
129
- | Flores 101 dev | 31,9 | **37,8** | 34,4 |
130
- | Flores 101 devtest |33,6 | **38,5** | 35,7 |
131
- | NTREX | 28,9 | **33,6** | 31,0 |
132
- | Average | 31,5 | **36,6** | 33,7 |
 
133
 
134
  ## Additional information
135
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the Fairseq toolkit on a combination of datasets comprising both Catalan-Portuguese data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Portuguese corpora using Projecte Aina’s Spanish-Catalan model. This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 
17
 
18
  ## Intended uses and limitations
19
 
 
53
 
54
  The model was trained on a combination of the following datasets:
55
 
56
+ | Datasets       | 
57
+ |----------------------|
58
+ | DGT |
59
+ |EU Bookshop |
60
+ | Europarl |
61
+ |Global Voices |
62
+ | GNOME |
63
+ |KDE 4 |
64
+ | Multi CCAligned |
65
+ | Multi Paracrawl |
66
+ | Multi UN |
67
+ | NLLB    |
68
+ | NTEU |
69
+ | Open Subtitles |
70
+ |Tatoeba |
71
+ |UNPC |
72
+ | WikiMatrix | 
73
+
74
+ All data was sourced from [OPUS](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/) After all Catalan-Portuguese data had been collected, Spanish-Portuguese data was collected and the Spanish data translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
75
 
76
  ### Training procedure
77
 
78
  ### Data preparation
79
 
80
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
81
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
82
  The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
83
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
130
 
131
  | Test set | SoftCatalà | Google Translate | aina-translator-pt-ca |
132
  |----------------------|------------|------------------|---------------|
133
+ | Flores 101 dev | 32 | **38,3** | 35,8 |
134
+ | Flores 101 devtest |33,4 | **39** | 37,1 |
135
+ | NTEU | 41,6 | 44,9 | **48,3** |
136
+ | NTREX | 28,8 | **33,6** | 32,1 |
137
+ | **Average** | 33,9 | **38,9** | 38,3 |
138
 
139
  ## Additional information
140