dataset to measure perplexity
#2
by
robbiemu
- opened
Do you guys have a multi-language dataset handy, that I can use to measure perplexity when I try to quantize this? like wikitext-2, but I believe that is just English: https://live.european-language-grid.eu/catalogue/corpus/5169 (ps, hablo un poco de español/portugués si eso te resultara más cómodo)
Hi @robbiemu ! For a representative dataset of the Salamandra training corpus you can use all dumps from Colossal Oscar 1.0 and Wikipedia. All 36 languages are available in both datasets. Other handy multilingual datasets that we used are OpenSubtitlesv2016, EurLEX-Resources and MC4-Legal, but these are not available for all 36 languages and are not as rich.
robbiemu
changed discussion status to
closed