Fairseq
Catalan
English
fdelucaf commited on
Commit
036dd02
1 Parent(s): bc181f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -29
README.md CHANGED
@@ -53,40 +53,19 @@ However, we are well aware that our models may be biased. We intend to conduct r
53
 
54
  ### Training data
55
 
56
- The model was trained on a combination of the following datasets:
57
-
58
- | Dataset | Sentences |
59
- |--------------------|----------------|
60
- | Global Voices | 21.342 |
61
- | Memories Lluires | 1.173.055 |
62
- | Wikimatrix | 1.205.908 |
63
- | TED Talks | 50.979 |
64
- | Tatoeba | 5.500 |
65
- | CoVost 2 ca-en | 79.633 |
66
- | CoVost 2 en-ca | 263.891 |
67
- | Europarl | 1.965.734 |
68
- | jw300 | 97.081 |
69
- | Crawled Generalitat| 38.595 |
70
- | Opus Books | 4.580 |
71
- | CC Aligned | 5.787.682 |
72
- | COVID_Wikipedia | 1.531 |
73
- | EuroBooks | 3.746 |
74
- | Gnome | 2.183 |
75
- | KDE 4 | 144.153 |
76
- | OpenSubtitles | 427.913 |
77
- | QED | 69.823 |
78
- | Ubuntu | 6.781 |
79
- | Wikimedia | 208.073 |
80
- |--------------------|----------------|
81
- | **Total** | **11.558.183** |
82
 
83
  ### Training procedure
84
 
85
  ### Data preparation
86
 
87
- All datasets are concatenated and filtered using the [mBERT Gencata parallel filter](https://huggingface.co/projecte-aina/mbert-base-gencata).
88
- Before training, the punctuation is normalized using a modified version of the join-single-file.py script from
89
- [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
 
90
 
91
 
92
  #### Tokenization
 
53
 
54
  ### Training data
55
 
56
+ The model was trained on a combination of several datasets, including data collected from [Opus](https://opus.nlpl.eu/),
57
+ [HPLT](https://hplt-project.org/), an internally created [CA-EN Parallel Corpus](https://huggingface.co/datasets/projecte-aina/CA-EN_Parallel_Corpus),
58
+ and other sources.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  ### Training procedure
61
 
62
  ### Data preparation
63
 
64
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
65
+ This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
66
+ The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
67
+ the punctuation is normalized using a modified version of the join-single-file.py script from
68
+ [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
69
 
70
 
71
  #### Tokenization