LeMaterial

Enterprise
non-profit
Activity Feed

AI & ML interests

AI4Science

Recent Activity

inelgnu  updated a dataset 4 days ago
LeMaterial/LeMat-BulkUnique
inelgnu  updated a dataset 4 days ago
LeMaterial/LeMat-Bulk
msiron  updated a dataset 6 days ago
LeMaterial/LeMat-BulkUnique
View all activity

LeMaterial's activity

msiron 
updated a Space 8 days ago
inelgnu 
updated a Space 10 days ago
msiron 
updated a Space 12 days ago
msiron 
in LeMaterial/admin 13 days ago
lritchie 
in LeMaterial/admin 13 days ago
thomwolf 
posted an update 14 days ago
view post
Post
4285
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
·
thomwolf 
posted an update 17 days ago
thomwolf 
posted an update 19 days ago
thomwolf 
posted an update 28 days ago
thomwolf 
posted an update about 1 month ago