Fairseq
Catalan
Portuguese
fdelucaf commited on
Commit
63f7285
1 Parent(s): afd55ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -3
README.md CHANGED
@@ -1,5 +1,12 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
4
  ## Projecte Aina’s Catalan-Portuguese machine translation model
5
 
@@ -76,19 +83,23 @@ The model was trained on a combination of the following datasets:
76
  | Europarl | 1.692.106 | 1.631.989 |
77
  | **Total** | **15.391.745** | **6.159.631** |
78
 
79
- All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/). The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
 
80
 
81
 
82
  ### Training procedure
83
 
84
  ### Data preparation
85
 
86
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
87
 
88
 
89
  #### Tokenization
90
 
91
- All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. This model is included.
 
92
 
93
  #### Hyperparameters
94
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - projecte-aina/CA-PT_Parallel_Corpus
5
+ language:
6
+ - ca
7
+ - pt
8
+ metrics:
9
+ - bleu
10
  ---
11
  ## Projecte Aina’s Catalan-Portuguese machine translation model
12
 
 
83
  | Europarl | 1.692.106 | 1.631.989 |
84
  | **Total** | **15.391.745** | **6.159.631** |
85
 
86
+ All corpora except Europarl were collected from [Opus](https://opus.nlpl.eu/).
87
+ The Europarl corpus is a synthetic parallel corpus created from the original Spanish-Catalan corpus by [SoftCatalà](https://github.com/Softcatala/Europarl-catalan).
88
 
89
 
90
  ### Training procedure
91
 
92
  ### Data preparation
93
 
94
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
95
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
96
+ The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
97
 
98
 
99
  #### Tokenization
100
 
101
+ All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data.
102
+ This model is included.
103
 
104
  #### Hyperparameters
105