HuggingFaceFW-Dev

Enterprise
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

HuggingFaceFW-Dev's activity

davanstrienΒ 
posted an update 6 days ago
view post
Post
1557
Introducing FineWeb-C πŸŒπŸŽ“, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
nataliaElvΒ 
posted an update 9 days ago
view post
Post
1604
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
lhoestqΒ 
posted an update 14 days ago
view post
Post
1611
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
πŸ”— Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
nataliaElvΒ 
posted an update 15 days ago
view post
Post
1245
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates πŸ˜‚


Do you want to see how your annotations compare to others?
πŸ‘‰ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
thomwolfΒ 
posted an update 17 days ago
view post
Post
4345
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of πŸ—£οΈlanguages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

πŸ₯‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive πŸ“œ ODC-By 1.0 license, and the πŸ’» code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a πŸ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
Β·
garrethleeΒ 
posted an update 20 days ago
view post
Post
1892
The latest o1 model from OpenAI is still unable to answer 9.11 > 9.9 correctly πŸ€”

A possible explanation? Tokenization - and our latest work investigates how it affects a model's ability to do math!

In this blog post, we discuss:
πŸ”’ The different ways numbers are tokenized in modern LLMs
πŸ§ͺ Our detailed approach in comparing these various methods
πŸ₯ͺ How we got a free boost in arithmetic performance by adding a few lines of code to the base Llama 3 tokenizer
πŸ‘‘ and a definitive, best tokenization method for math in LLMs!

Check out our work here: huggingface/number-tokenization-blog
  • 2 replies
Β·
dvilasueroΒ 
posted an update 20 days ago
view post
Post
2263
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior TΓ©cnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. πŸ—½ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. βš–οΈ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
thomwolfΒ 
posted an update 20 days ago
thomwolfΒ 
posted an update 22 days ago
nataliaElvΒ 
posted an update 23 days ago
view post
Post
1177
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
davanstrienΒ 
posted an update 27 days ago
view post
Post
492
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).
garrethleeΒ 
posted an update 28 days ago
view post
Post
361
Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week πŸ‘€)!

πŸ”Ή Bringing objectivity to comparisons

Existing comparisons of number tokenization methods often ignore the difference in models’ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more β€œlearning” being done by these bigger models.

We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts.

πŸ”Ή Key results

We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints):

- When splitting evals based on operators, single-digit tokenization consistently outperformed other methods.
- Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks.

All in all, single-digit tokenization is best compared to other methods, and similar to our previous post’s finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest!

The wait is almost over πŸ€—, the full report is coming next week - stay tuned!
nataliaElvΒ 
posted an update 29 days ago
view post
Post
1627
Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6
davanstrienΒ 
posted an update about 1 month ago
view post
Post
2474
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts πŸ¦‹

πŸ“Š 1M public posts from Bluesky's firehose API
πŸ” Includes text, metadata, and language predictions
πŸ”¬ Perfect to experiment with using ML for Bluesky πŸ€—

Excited to see people build more open tools for a more open social media platform!
davanstrienΒ 
posted an update about 1 month ago
view post
Post
1350
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time
  • 1 reply
Β·
nataliaElvΒ 
posted an update about 1 month ago
view post
Post
361
You can now add your Bluesky handle to your Hugging Face profile! πŸ¦‹
Have you noticed?
thomwolfΒ 
posted an update about 1 month ago
davanstrienΒ 
posted an update about 1 month ago