projecte-aina
/

aina-translator-ca-en

Model card Files Files and versions Community

fdelucaf commited on May 15

Commit

036dd02

•

1 Parent(s): bc181f2

Update README.md

Files changed (1) hide show

README.md +8 -29

README.md CHANGED Viewed

@@ -53,40 +53,19 @@ However, we are well aware that our models may be biased. We intend to conduct r
 ### Training data
-The model was trained on a combination of the following datasets:
-| Dataset            | Sentences      |
-|--------------------|----------------|
-| Global Voices      | 21.342         |
-| Memories Lluires   | 1.173.055      |
-| Wikimatrix         | 1.205.908      |
-| TED Talks          | 50.979         |
-| Tatoeba            | 5.500          |
-| CoVost 2 ca-en     | 79.633         |
-| CoVost 2 en-ca     | 263.891        |
-| Europarl           | 1.965.734      |
-| jw300              | 97.081         |
-| Crawled Generalitat| 38.595         |
-| Opus Books         | 4.580          |
-| CC Aligned         | 5.787.682      |
-| COVID_Wikipedia    | 1.531          |
-| EuroBooks          | 3.746          |
-| Gnome              | 2.183          |
-| KDE 4              | 144.153        |
-| OpenSubtitles      | 427.913        |
-| QED                | 69.823         |
-| Ubuntu             | 6.781          |
-| Wikimedia          | 208.073        |
-|--------------------|----------------|
-| **Total**          | **11.558.183** |
 ### Training procedure
 ### Data preparation
- All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata).
- Before training, the punctuation is normalized using a modified version of the join-single-file.py script from
- [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization

 ### Training data
+The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/),
+[HPLT](https://hplt-project.org/), an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus),
+and other sources.
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
+ the punctuation is normalized using a modified version of the join-single-file.py script from
+ [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
 #### Tokenization