projecte-aina
/

aina-translator-ca-pt

Fairseq

Catalan

Portuguese

Model card Files Files and versions Community

AudreyVM commited on Nov 6

Commit

6f6dcb4

•

1 Parent(s): ee0590b

Update README.md

Browse files

Files changed (1) hide show

README.md +31 -21

README.md CHANGED Viewed

@@ -13,8 +13,9 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
-which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -54,28 +55,36 @@ However, we are well aware that our models may be biased. We intend to conduct r
 The model was trained on a combination of the following datasets:
-| Dataset       	| Sentences  	| Sentences after Cleaning|
-|-------------------|----------------|-------------------|
-| CCMatrix  v1  	| 12.674.684  	| 	3.765.459|
-| WikiMatrix  	| 358.873 	| 317.649 	|
-| GNOME	| 5.211	|	1.752|
-| KDE4    	| 166.208   	|  117.828 	|
-| OpenSubtitles	| 384.142	| 235.604	|
-| GlobalVoices| 4.035 	|	3.430|
-| Tatoeba | 754 | 723 |
-| Europarl | 1.692.106 | 1.631.989 |
-All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
 The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
@@ -115,7 +124,7 @@ The model was trained for a total of 17.000 updates. Weights were saved every 10
 ### Variable and metrics
-We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 ### Evaluation results
@@ -124,10 +133,11 @@ compared to [Softcatalà](https://www.softcatala.org/) and [Google Translate](ht
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-pt |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 30,9     	| **41,4**     	| 34,3     	|
-| Flores 101 devtest   |31,6   	| **41,3**     	| 35,2     	|
-| NTREX | 27,9 | **30,1** | 28,0 |
-| Average          	| 30,1   	| **37,6**     	| 32,5      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising
+both Catalan-Portuguese data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Portuguese corpora using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
+This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
 The model was trained on a combination of the following datasets:
+| Datasets       |
+|----------------------|
+| DGT |
+|EU Bookshop |
+| Europarl |
+|Global Voices |
+| GNOME |
+|KDE 4 |
+| Multi CCAligned |
+| Multi Paracrawl |
+| Multi UN |
+| NLLB    |
+| NTEU |
+| Open Subtitles |
+|Tatoeba |
+|UNPC |
+| WikiMatrix |
+All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/).
 The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
+After all Catalan-Portuguese data had been collected, Spanish-Portuguese data was collected and the Spanish data translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form the final corpus. Before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
 ### Variable and metrics
+We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), NTEU (unpublished), and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 ### Evaluation results
 | Test set         	| SoftCatalà | Google Translate | aina-translator-ca-pt |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 30,8     	| **39,7**     	| 38,5     	|
+| Flores 101 devtest   |31,5   	| **39**     	| 39     	|
+| NTEU| 41,7 | 47,1 | **57,4** |
+| NTREX | 27,9 | **30,2** | 28,9 |
+| **Average**          	| 33   	| 39     	|**41**      	|
 ## Additional information