smol-explorers

community

AI & ML interests

None defined yet.

Recent Activity

hlarcher authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

plaguss authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

andito authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

View all activity

smol-explorers's activity

andito

posted an update 4 days ago

Post

2348

Extremely bullish on @CohereForAI 's Aya Vision (8B & 32B) - new SOTA open-weight VLMs

- 8B wins up to 81% of the time in its class, better than Gemini Flash
- 32B beats Llama 3.2 90B!
- Covers 23 languages, excels in image captioning, VQA & more
- Integrated on transformers from Day 0!

Efficient multimodal models are here to stay!!🔥
Check out their blog! https://huggingface.co/blog/aya-vision

hlarcher

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

plaguss

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

andito

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

eliebak

authored a paper about 1 month ago

INTELLECT-1 Technical Report

Paper • 2412.01152 • Published Dec 2, 2024 • 1

loubnabnl

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

eliebak

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

anton-l

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

thomwolf

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

lvwerra

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 199

andito

posted an update about 1 month ago

Post

1630

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝘁𝗵𝗲 𝘄𝗼𝗿𝗹𝗱'𝘀 𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 𝘃𝗶𝘀𝗶𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹!

We’re thrilled to share 𝗦𝗺𝗼𝗹𝗩𝗟𝗠 (256M & 500M)—the smallest Visual Language Models ever built. Think: running on <1GB of GPU memory—you can fine-tune it on your laptop and run it on your toaster!

Why It’s Game-Changing:
- 𝗢𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗟𝗮𝗿𝗴𝗲𝗿 𝗠𝗼𝗱𝗲𝗹𝘀: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction!
𝗠𝗶𝗴𝗵𝘁𝘆 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: The 256M version delivers 80% of our 2.2B model’s performance, and the 500M version hits 90%
𝗟𝗶𝗴𝗵𝘁𝗻𝗶𝗻𝗴-𝗙𝗮𝘀𝘁 𝗦𝗲𝗮𝗿𝗰𝗵: SmolVLM integrates with ColiPali for state-of-the-art retrieval speeds—on par with models 10x bigger. That means cheaper, faster indexing and real-world impact.

What’s New Under the Hood:
- 𝗡𝗲𝘄 𝗩𝗶𝘀𝗶𝗼𝗻 𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Smaller overall size (400M -> 93M), but with higher resolution.
- 𝗛𝗶𝗴𝗵𝗲𝗿 𝗣𝗶𝘅𝗲𝗹𝘀/𝗧𝗼𝗸𝗲𝗻: 4096 vs. 1820—more efficient image processing.
- 𝗦𝗺𝗮𝗿𝘁 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Faster training and a performance boost.

Check our blog: https://huggingface.co/blog/smolervlm
The models: HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
The demo: HuggingFaceTB/SmolVLM-256M-Demo

1 reply

·

andito

updated 2 models about 2 months ago

HuggingFaceTB/SmolVLM-500M-Base

Image-Text-to-Text • Updated Jan 20 • 408 • 8

HuggingFaceTB/SmolVLM-256M-Base

Image-Text-to-Text • Updated Jan 20 • 2.14k • 9

orrzohar

authored a paper about 2 months ago

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Paper • 2501.09755 • Published Jan 16 • 34

lvwerra

authored a paper about 2 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 56

thomwolf

authored a paper about 2 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 56

hlarcher

posted an update about 2 months ago

Post

1077

We are introducing multi-backend support in Hugging Face Text Generation Inference!
With new TGI architecture we are now able to plug new modeling backends to get best performances according to selected model and available hardware. This first step will very soon be followed by the integration of new backends (TRT-LLM, llama.cpp, vLLM, Neuron and TPU).

We are polishing the TensorRT-LLM backend which achieves impressive performances on NVIDIA GPUs, stay tuned 🤗 !

Check out the details: https://huggingface.co/blog/tgi-multi-backend

anton-l

posted an update 3 months ago

Post

2555

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

thomwolf

posted an update 3 months ago

Post

5998

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

·

thomwolf

posted an update 3 months ago

Post

1595

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024