Bluesky Community

community
Activity Feed

AI & ML interests

Tools for Bluesky 🦋

Recent Activity

bluesky-community's activity

davanstrien 
posted an update 5 days ago
view post
Post
1526
Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
clem 
posted an update 8 days ago
view post
Post
1532
Coming back to Paris Friday to open our new Hugging Face office!

We're at capacity for the party but add your name in the waiting list as we're trying to privatize the passage du Caire for extra space for robots 🤖🦾🦿

https://t.co/enkFXjWndJ
  • 1 reply
·
nataliaElv 
posted an update 9 days ago
view post
Post
1598
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
nataliaElv 
posted an update 15 days ago
view post
Post
1244
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂


Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
cfahlgren1 
posted an update 22 days ago
view post
Post
1810
You can just ask things 🗣️

"show me messages in the coding category that are in the top 10% of reward model scores"

Download really high quality instructions from the Llama3.1 405B synthetic dataset 🔥

argilla/magpie-ultra-v1.0

nataliaElv 
posted an update 23 days ago
view post
Post
1177
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
clem 
posted an update 24 days ago
view post
Post
4101
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):

- There will be the first major public protest related to AI
- A big company will see its market cap divided by two or more because of AI
- At least 100,000 personal AI robots will be pre-ordered
- China will start to lead the AI race (as a consequence of leading the open-source AI race).
- There will be big breakthroughs in AI for biology and chemistry.
- We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.

How my predictions for 2024 turned out:

- A hyped AI company will go bankrupt or get acquired for a ridiculously low price
✅ (Inflexion, AdeptAI,...)

- Open-source LLMs will reach the level of the best closed-source LLMs
✅ with QwQ and dozens of others

- Big breakthroughs in AI for video, time-series, biology and chemistry
✅ for video 🔴for time-series, biology and chemistry

- We will talk much more about the cost (monetary and environmental) of AI
✅Monetary 🔴Environmental (😢)

- A popular media will be mostly AI-generated
✅ with NotebookLM by Google

- 10 millions AI builders on Hugging Face leading to no increase of unemployment
🔜currently 7M of AI builders on Hugging Face
·
cfahlgren1 
posted an update 24 days ago
view post
Post
2991
We just dropped an LLM inside the SQL Console 🤯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think 🤗
clem 
posted an update 26 days ago
view post
Post
4350
Hugging Face is becoming the best place to share the most viral AI apps with spaces.

Kolors Virtual Try-on just crossed 6,000,000 unique visitors & is now the #5 most popular space. Congrats to the Kwai Kolors team!

Kwai-Kolors/Kolors-Virtual-Try-On
  • 2 replies
·
davanstrien 
posted an update 27 days ago
view post
Post
488
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).
nataliaElv 
posted an update 29 days ago
view post
Post
1627
Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6