Qian Liu's picture

Qian Liu

SivilTaram

AI & ML interests

Cooking cool things

Recent Activity

Articles

Organizations

Microsoft's profile picture Spaces-explorers's profile picture Multimodal Art Projection's profile picture Sea AI Lab's profile picture Table Research Lab's profile picture XLang NLP Lab's profile picture BigCode's profile picture OpenLemur's profile picture Sea Language Team's profile picture LoraHub's profile picture StarPEFT's profile picture Qwen's profile picture S3Eval's profile picture GAIR-ProX's profile picture Learning UnkNown librAry's profile picture code-world-model's profile picture Social Post Explorers's profile picture Sailor2's profile picture Sea AI Lab-Sailor's profile picture OpenCoder's profile picture Sailor2 Evaluation's profile picture ml-fw-prerelease's profile picture Data Is Better Together Contributor's profile picture

Posts 4

view post
Post
2536
Still following your human intuition to mix corpora from different sources for pre-training ๐Ÿง ? Everyone says that data mixture has a big impact on model performance, but how - and why๐Ÿ•ต๏ธ? Did you know that web corpora are actually highly impactful for downstream tasks ๐Ÿ†?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐Ÿ“„

๐Ÿ”ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐Ÿ“ˆ

๐Ÿ“„ Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
๐Ÿ’ป Code: https://github.com/sail-sg/regmix
๐Ÿ“Š Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
๐ŸŽฎ Demo: https://huggingface.co/spaces/sail/RegMix
view post
Post
2342
Introducing Sailor-14B Model and Sailor2 Project ๐Ÿšข

We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!

โœ…Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model.
โœ…The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.

๐Ÿ Home: https://sailorllm.github.io
๐Ÿค—Model: sail/Sailor-14B-Chat
๐Ÿ’ปDemo: sail/Sailor-14B-Chat

We're also excited to introduce the Sailor2 project, โœจ an open collaboration opportunity for the entire community! โœจ

๐ŸŒ The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.

๐ŸŽฏThe model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.

๐Ÿค Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.

๐ŸŒ Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! ๐Ÿ”

Let's sail together and enjoy the journey!โš“