dataset to measure perplexity

#2
by robbiemu - opened

Do you guys have a multi-language dataset handy, that I can use to measure perplexity when I try to quantize this? like wikitext-2, but I believe that is just English: https://live.european-language-grid.eu/catalogue/corpus/5169 (ps, hablo un poco de español/portugués si eso te resultara más cómodo)

Language Technologies Unit @ Barcelona Supercomputing Center org

Hi @robbiemu ! For a representative dataset of the Salamandra training corpus you can use all dumps from Colossal Oscar 1.0 and Wikipedia. All 36 languages are available in both datasets. Other handy multilingual datasets that we used are OpenSubtitlesv2016, EurLEX-Resources and MC4-Legal, but these are not available for all 36 languages and are not as rich.

robbiemu changed discussion status to closed

Sign up or log in to comment