@thomwolf on Hugging Face: "We are proud to announce…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

thomwolf

posted an update Dec 9, 2024

Post

5710

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

AtAndDev

Dec 10, 2024

•

edited Dec 10, 2024

in the second graph it says lower is better, am i missin something, it should be higher is better no?

lvwerra

Dec 13, 2024

@AtAndDev it is a bit confusing: a lower rank is better, but the lower rank is actually higher 🙃

In this post