projecte-aina
/

aina-translator-ca-pt

Model card Files Files and versions Community

fdelucaf commited on Mar 7

Commit

63f7285

•

1 Parent(s): afd55ad

Update README.md

Files changed (1) hide show

README.md +14 -3

README.md CHANGED Viewed

@@ -1,5 +1,12 @@
 ---
 license: apache-2.0
 ---
 ## Projecte Aina’s Catalan-Portuguese machine translation model
@@ -76,19 +83,23 @@ The model was trained on a combination of the following datasets:
 | Europarl | 1.692.106 | 1.631.989 |
 | **Total**     	| **15.391.745** | **6.159.631** |
-All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/). The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model  learned from the combination of all filtered training data. This model is included.
 #### Hyperparameters

 ---
 license: apache-2.0
+datasets:
+- projecte-aina/CA-PT_Parallel_Corpus
+language:
+- ca
+- pt
+metrics:
+- bleu
 ---
 ## Projecte Aina’s Catalan-Portuguese machine translation model
 | Europarl | 1.692.106 | 1.631.989 |
 | **Total**     	| **15.391.745** | **6.159.631** |
+All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
+The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model  learned from the combination of all filtered training data.
+ This model is included.
 #### Hyperparameters