Still following your human intuition to mix corpora from different sources for pre-training ๐ง ? Everyone says that data mixture has a big impact on model performance, but how - and why๐ต๏ธ? Did you know that web corpora are actually highly impactful for downstream tasks ๐?
Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐
๐ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐
Introducing Sailor-14B Model and Sailor2 Project ๐ข
We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!
โ Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model. โ The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.
We're also excited to introduce the Sailor2 project, โจ an open collaboration opportunity for the entire community! โจ
๐ The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.
๐ฏThe model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.
๐ค Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.
๐ Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! ๐