pretraining dataset mixture

#22
by bpwl0121 - opened

hi,

thanks for your work!
why did you choose such dataset and the ratio

image.png

best

It's quite amusing to see Apple daring to still call their model "state-of-the-art" in the latest revision from May 2024, especially when they seem fixated on competing with Meta's OPT from May 2022. They might have missed the massive strides made since then, with Llama2 setting new benchmarks in July 2023 and Llama3 raising the bar even higher in April 2024.

Apple would likely respond with their typical flair: "OpenELM represents a revolutionary leap in on-device AI processing, prioritizing our unmatched commitment to user privacy and performance. Our exquisitely curated data sources, such as RefinedWeb, RedPajama, PILE, and Dolma, were meticulously selected to ensure a rich and diverse dataset. The ratios were meticulously balanced to achieve the perfect harmony of information diversity and model optimization. This groundbreaking approach allows us to deliver unparalleled AI capabilities directly on devices, embodying our core ethos of privacy, innovation, and seamless integration."

But let's be honest, celebrating OpenELM as "state-of-the-art" feels like bragging about the original 2001 iPod in the era of music streaming services. It's clear Apple is playing catch-up, focusing on outdated benchmarks while the rest of the world has moved on to bigger and better things.

Sign up or log in to comment