projecte-aina
/

aina-translator-pt-ca

Fairseq

Portuguese

Catalan

Model card Files Files and versions Community

AudreyVM commited on Nov 6

Commit

7a91206

•

1 Parent(s): 5eb40f0

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -21

README.md CHANGED Viewed

@@ -13,8 +13,7 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-Portuguese datasets,
-which after filtering and cleaning comprised 6.159.631 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -54,26 +53,31 @@ However, we are well aware that our models may be biased. We intend to conduct r
 The model was trained on a combination of the following datasets:
-| Dataset       	| Sentences  	| Sentences after Cleaning|
-|-------------------|----------------|-------------------|
-| CCMatrix  v1  	| 12.674.684  	| 	3.765.459|
-| WikiMatrix  	| 358.873 	| 317.649 	|
-| GNOME	| 5.211	|	1.752|
-| KDE4    	| 166.208   	|  117.828 	|
-| OpenSubtitles	| 384.142	| 235.604	|
-| GlobalVoices| 4.035 	|	3.430|
-| Tatoeba | 754 | 723 |
-| Europarl | 1.692.106 | 1.631.989 |
-All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
-The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
  The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
@@ -126,10 +130,11 @@ Below are the evaluation results on the machine translation from Portuguese to C
 | Test set         	| SoftCatalà | Google Translate | aina-translator-pt-ca |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 31,9     	| **37,8**     	| 34,4     	|
-| Flores 101 devtest   |33,6   	| **38,5**     	| 35,7     	|
-| NTREX | 28,9 | **33,6** | 31,0 |
-| Average          	| 31,5   	| **36,6**     	| 33,7      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the Fairseq toolkit on a combination of datasets comprising both Catalan-Portuguese data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Portuguese corpora using Projecte Aina’s Spanish-Catalan model. This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
 The model was trained on a combination of the following datasets:
+| Datasets       |
+|----------------------|
+| DGT |
+|EU Bookshop |
+| Europarl |
+|Global Voices |
+| GNOME |
+|KDE 4 |
+| Multi CCAligned |
+| Multi Paracrawl |
+| Multi UN |
+| NLLB    |
+| NTEU |
+| Open Subtitles |
+|Tatoeba |
+|UNPC |
+| WikiMatrix |
+All data was sourced from [OPUS](https://opus.nlpl.eu/) and [ELRC](https://www.elrc-share.eu/) After all Catalan-Portuguese data had been collected, Spanish-Portuguese data was collected and the Spanish data translated to Catalan using [Projecte Aina’s Spanish-Catalan model.](https://huggingface.co/projecte-aina/aina-translator-es-ca)
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
  The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 | Test set         	| SoftCatalà | Google Translate | aina-translator-pt-ca |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 32     	| **38,3**     	| 35,8     	|
+| Flores 101 devtest   |33,4  	| **39**     	| 37,1     	|
+| NTEU | 41,6 | 44,9 | **48,3** |
+| NTREX | 28,8 | **33,6** | 32,1 |
+| **Average**         	| 33,9  	| **38,9**     	| 38,3      	|
 ## Additional information