jsaizant commited on
Commit
737aed8
1 Parent(s): 0e6a3ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -287,7 +287,7 @@ and the rest of the languages were kept as is, resulting in the following distri
287
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
288
  which contributes a significant 66.06% of the total tokens.
289
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
290
- The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
291
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
292
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
293
  The remaining 10% comes from smaller sources in various languages.
@@ -301,7 +301,6 @@ Feel free to click the expand button below to see the full list of sources.
301
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
302
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
303
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
304
- | Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
305
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
306
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
307
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
@@ -331,7 +330,7 @@ Feel free to click the expand button below to see the full list of sources.
331
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
332
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
333
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
334
- | Biomedical | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
335
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
336
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
337
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
 
287
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
288
  which contributes a significant 66.06% of the total tokens.
289
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
290
+ The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
291
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
292
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
293
  The remaining 10% comes from smaller sources in various languages.
 
301
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
302
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
303
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
 
304
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
305
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
306
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
 
330
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
331
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
332
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
333
+ | Biomedical | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
334
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
335
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
336
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |