Update README.md
Browse files
README.md
CHANGED
@@ -53,40 +53,19 @@ However, we are well aware that our models may be biased. We intend to conduct r
|
|
53 |
|
54 |
### Training data
|
55 |
|
56 |
-
The model was trained on a combination of
|
57 |
-
|
58 |
-
|
59 |
-
|--------------------|----------------|
|
60 |
-
| Global Voices | 21.342 |
|
61 |
-
| Memories Lluires | 1.173.055 |
|
62 |
-
| Wikimatrix | 1.205.908 |
|
63 |
-
| TED Talks | 50.979 |
|
64 |
-
| Tatoeba | 5.500 |
|
65 |
-
| CoVost 2 ca-en | 79.633 |
|
66 |
-
| CoVost 2 en-ca | 263.891 |
|
67 |
-
| Europarl | 1.965.734 |
|
68 |
-
| jw300 | 97.081 |
|
69 |
-
| Crawled Generalitat| 38.595 |
|
70 |
-
| Opus Books | 4.580 |
|
71 |
-
| CC Aligned | 5.787.682 |
|
72 |
-
| COVID_Wikipedia | 1.531 |
|
73 |
-
| EuroBooks | 3.746 |
|
74 |
-
| Gnome | 2.183 |
|
75 |
-
| KDE 4 | 144.153 |
|
76 |
-
| OpenSubtitles | 427.913 |
|
77 |
-
| QED | 69.823 |
|
78 |
-
| Ubuntu | 6.781 |
|
79 |
-
| Wikimedia | 208.073 |
|
80 |
-
|--------------------|----------------|
|
81 |
-
| **Total** | **11.558.183** |
|
82 |
|
83 |
### Training procedure
|
84 |
|
85 |
### Data preparation
|
86 |
|
87 |
-
All datasets are
|
88 |
-
|
89 |
-
|
|
|
|
|
90 |
|
91 |
|
92 |
#### Tokenization
|
|
|
53 |
|
54 |
### Training data
|
55 |
|
56 |
+
The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/),
|
57 |
+
[HPLT](https://hplt-project.org/), an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus),
|
58 |
+
and other sources.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
### Training procedure
|
61 |
|
62 |
### Data preparation
|
63 |
|
64 |
+
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
|
65 |
+
This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
|
66 |
+
The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
|
67 |
+
the punctuation is normalized using a modified version of the join-single-file.py script from
|
68 |
+
[SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
|
69 |
|
70 |
|
71 |
#### Tokenization
|