dataset to measure perplexity

by robbiemu - opened Oct 10, 2024

Oct 10, 2024

•

edited Oct 10, 2024

Do you guys have a multi-language dataset handy, that I can use to measure perplexity when I try to quantize this? like wikitext-2, but I believe that is just English: https://live.european-language-grid.eu/catalogue/corpus/5169 (ps, hablo un poco de español/portugués si eso te resultara más cómodo)

jsaizant

Language Technologies Unit @ Barcelona Supercomputing Center org Oct 10, 2024

Hi @robbiemu ! For a representative dataset of the Salamandra training corpus you can use all dumps from Colossal Oscar 1.0 and Wikipedia. All 36 languages are available in both datasets. Other handy multilingual datasets that we used are OpenSubtitlesv2016, EurLEX-Resources and MC4-Legal, but these are not available for all 36 languages and are not as rich.

robbiemu changed discussion status to closed Oct 10, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment