47 106 392

Thomas Wolf PRO

thomwolf

https://thomwolf.io

AI & ML interests

NLP and open-source :-)

Recent Activity

liked a Space 1 day ago

ECCV/ECCV2024-papers

liked a model 1 day ago

IamCreateAI/Ruyi-Mini-7B

liked a dataset 1 day ago

data-is-better-together/fineweb-c

View all activity

Articles

LeMaterial: an open source initiative to accelerate materials discovery and research

13 days ago

• 30

Organizations

Posts 12

Post

4286

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

Post

859

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

View all posts