Update README.md
Browse files
README.md
CHANGED
@@ -287,7 +287,7 @@ and the rest of the languages were kept as is, resulting in the following distri
|
|
287 |
This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
|
288 |
which contributes a significant 66.06% of the total tokens.
|
289 |
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
|
290 |
-
The next largest sources are French
|
291 |
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
292 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
293 |
The remaining 10% comes from smaller sources in various languages.
|
@@ -301,7 +301,6 @@ Feel free to click the expand button below to see the full list of sources.
|
|
301 |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
302 |
| Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
|
303 |
| Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
|
304 |
-
| Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
|
305 |
| Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
|
306 |
| Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
|
307 |
| OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
|
@@ -331,7 +330,7 @@ Feel free to click the expand button below to see the full list of sources.
|
|
331 |
| proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
|
332 |
| RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
|
333 |
| The Pile (PhilPapers subset) | en | Gao et al., 2021 |
|
334 |
-
| Biomedical | es | Internally generated
|
335 |
| HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
|
336 |
| Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
|
337 |
| Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
|
|
|
287 |
This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
|
288 |
which contributes a significant 66.06% of the total tokens.
|
289 |
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
|
290 |
+
The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
|
291 |
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
|
292 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
293 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
301 |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
302 |
| Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
|
303 |
| Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
|
|
|
304 |
| Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
|
305 |
| Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
|
306 |
| OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
|
|
|
330 |
| proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
|
331 |
| RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
|
332 |
| The Pile (PhilPapers subset) | en | Gao et al., 2021 |
|
333 |
+
| Biomedical | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
|
334 |
| HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
|
335 |
| Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
|
336 |
| Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
|