Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
SivilTaramย 
posted an update Jul 11
Post
2446
Still following your human intuition to mix corpora from different sources for pre-training ๐Ÿง ? Everyone says that data mixture has a big impact on model performance, but how - and why๐Ÿ•ต๏ธ? Did you know that web corpora are actually highly impactful for downstream tasks ๐Ÿ†?

Check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" ๐Ÿ“„

๐Ÿ”ฌ In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! ๐Ÿ“ˆ

๐Ÿ“„ Paper: RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492)
๐Ÿ’ป Code: https://github.com/sail-sg/regmix
๐Ÿ“Š Collection: sail/regmix-data-mixture-as-regression-6682b6caab37b9442877f0ce
๐ŸŽฎ Demo: https://huggingface.co/spaces/sail/RegMix
In this post