TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages Paper • 2502.11020 • Published Feb 16 • 3
Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers Paper • 2503.00865 • Published Mar 2 • 61
MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published Feb 20 • 188
Automatic Speech Recognition of Low-Resource Languages Based on Chukchi Paper • 2210.05726 • Published Oct 11, 2022 • 1
Dialectal and Low Resource Machine Translation for Aromanian Paper • 2410.17728 • Published Oct 23, 2024 • 1
Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer Paper • 2404.04042 • Published Apr 5, 2024 • 2
LLMs for Extremely Low-Resource Finno-Ugric Languages Paper • 2410.18902 • Published Oct 24, 2024 • 3
Zerpal Collection The largest open-source Udmurt monolingual corpora and pre-trained language models • 14 items • Updated Jun 14, 2024 • 1
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling Paper • 2311.00430 • Published Nov 1, 2023 • 59
SberQuAD -- Russian Reading Comprehension Dataset: Description and Analysis Paper • 1912.09723 • Published Dec 20, 2019 • 2
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset Paper • 2309.04662 • Published Sep 9, 2023 • 23