Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,12 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
## Projecte Aina’s Catalan-Portuguese machine translation model
|
5 |
|
@@ -76,19 +83,23 @@ The model was trained on a combination of the following datasets:
|
|
76 |
| Europarl | 1.692.106 | 1.631.989 |
|
77 |
| **Total** | **15.391.745** | **6.159.631** |
|
78 |
|
79 |
-
All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
|
|
|
80 |
|
81 |
|
82 |
### Training procedure
|
83 |
|
84 |
### Data preparation
|
85 |
|
86 |
-
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
|
|
|
|
87 |
|
88 |
|
89 |
#### Tokenization
|
90 |
|
91 |
-
All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
|
|
|
92 |
|
93 |
#### Hyperparameters
|
94 |
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- projecte-aina/CA-PT_Parallel_Corpus
|
5 |
+
language:
|
6 |
+
- ca
|
7 |
+
- pt
|
8 |
+
metrics:
|
9 |
+
- bleu
|
10 |
---
|
11 |
## Projecte Aina’s Catalan-Portuguese machine translation model
|
12 |
|
|
|
83 |
| Europarl | 1.692.106 | 1.631.989 |
|
84 |
| **Total** | **15.391.745** | **6.159.631** |
|
85 |
|
86 |
+
All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
|
87 |
+
The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
|
88 |
|
89 |
|
90 |
### Training procedure
|
91 |
|
92 |
### Data preparation
|
93 |
|
94 |
+
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
95 |
+
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
96 |
+
The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
97 |
|
98 |
|
99 |
#### Tokenization
|
100 |
|
101 |
+
All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
|
102 |
+
This model is included.
|
103 |
|
104 |
#### Hyperparameters
|
105 |
|