Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

Norod78ย 
posted an update 2 days ago
view post
Post
1197
Multilingual Tokenization Showdown
Analyzing 12 LLM Tokenizers Across 204 Languages.

First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages:
Norod78/WikiCat-Multilingual

For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language).

You can see a slideshow summary of the results here:
https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html

I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo:
https://github.com/Norod/wikicat-tokenizer-eval

Post on X:
https://x.com/Norod78/status/1984366900550266999

DavidAUย 
posted an update 3 days ago
nouamanetaziย 
posted an update 4 days ago
view post
Post
2953
After training ๐’๐ฆ๐จ๐ฅ๐‹๐Œ๐Ÿ‘ on ๐Ÿ‘๐Ÿ–๐Ÿ’ ๐‡๐Ÿ๐ŸŽ๐ŸŽ๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ž ๐ฆ๐š๐ค๐ž-๐จ๐ซ-๐›๐ซ๐ž๐š๐ค ๐Ÿ๐š๐œ๐ญ๐จ๐ซ ๐ข๐ง ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ . ๐Ÿ”ฅ

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐‚๐‚๐‹ ๐ž๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐Ÿ”๐ŸŽ% ๐ž๐Ÿ๐Ÿ๐ข๐œ๐ข๐ž๐ง๐œ๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ž ๐จ๐Ÿ ๐ญ๐ก๐ž ๐ก๐š๐ซ๐๐ฐ๐š๐ซ๐ž. ๐Ÿ› ๏ธ

Questions that seemed simple but had no clear answers: Why is ๐Œ๐จ๐„ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ฌ๐ฅ๐จ๐ฐ๐ž๐ซ ๐ญ๐ก๐š๐ง ๐๐ž๐ง๐ฌ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ? Which ๐๐‚๐‚๐‹ ๐Ÿ๐ฅ๐š๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?

That's why we built ๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค ๐Ÿ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ฅ๐š๐ฒ๐ž๐ซ that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: ๐‡๐๐Œ๐Ÿ‘ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐  ๐Ÿ‘ ๐“๐/๐ฌ, ๐๐•๐‹๐ข๐ง๐ค ๐Ÿ’.๐ŸŽ ๐ซ๐ž๐š๐œ๐ก๐ข๐ง๐  ๐Ÿ•๐Ÿ–๐Ÿ” ๐†๐/๐ฌ, ๐๐‚๐ˆ๐ž ๐†๐ž๐ง๐Ÿ’ ๐š๐ญ ๐Ÿ๐Ÿ’.๐Ÿ ๐†๐/๐ฌ. Then we ran collective operations across ๐Ÿ๐Ÿ๐Ÿ– ๐†๐๐”๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐Ÿ’๐Ÿ–๐ŸŽ ๐†๐/๐ฌ on a single node to ๐Ÿ‘๐Ÿ๐ŸŽ-๐Ÿ‘๐Ÿ“๐ŸŽ ๐†๐/๐ฌ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS

Shared with โค๏ธ by the HuggingFace team
Shivansh000ย 
posted an update 2 days ago
view post
Post
1708
I am dedicating this weekend to practicing/reading the latest b(ook)log from hugging face. It is meant to be a guide for anyone trying to go from โ€œwe have a great dataset and GPUsโ€ to โ€œwe built a really strong model.โ€ Will share thoughts upon completion.

Thanks for the treat @eliebak @ThomasWolf and HF team!

HuggingFaceTB/smol-training-playbook
sergiopaniegoย 
posted an update 3 days ago
Kseniaseย 
posted an update about 18 hours ago
view post
Post
2183
11 Fascinating new Policy Optimization techniques

Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them:

1. BAlanced Policy Optimization (BAPO) โ†’ BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping (2510.18927)
Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse

2. Training-Free GRPO โ†’ Training-Free Group Relative Policy Optimization (2510.08191)
Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the modelโ€™s behavior

3. Asymmetric Importance Sampling Policy Optimization (ASPO) โ†’ ASPO: Asymmetric Importance Sampling Policy Optimization (2510.06062)
Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable

4. In-Context Steered Policy Optimization (ICPO) โ†’ https://arxiv.org/abs/2510.26519
Uses a modelโ€™s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence

5. Graph-Enhanced Policy Optimization (GEPO) โ†’ https://arxiv.org/abs/2510.26270
Builds a graph of an agentโ€™s experiences to understand how different states connect, guide exploration and assign rewards more effectively

6. Information Gain-based Policy Optimization (IGPO) โ†’ Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents (2510.14967)
Uses the modelโ€™s own belief updates to create dense, informative feedback for smoother multi-turn learning

Read further below โฌ‡๏ธ
If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
  • 1 reply
ยท
ronantakizawaย 
posted an update 2 days ago
view post
Post
1549
Introducing the Medical-o1-Reasoning-SFT-Japanese dataset ๐ŸŽ‰

This dataset is a Japanese dataset consisting questions, reasoning, and answer results for complex medical topics.

#japanese #medical #dataset


ronantakizawa/Medical-o1-Reasoning-SFT-Japanese
unmodeled-tylerย 
posted an update 2 days ago
view post
Post
379
New Datasets Published:
vanta-research/poetic-imagery-small
vanta-research/excitement-small

We are open sourcing two of our datasets today, which were used in the training of Apollo Astralis 8B and 4B.

The first dataset, poetic-imagery-small is designed to give the model's responses a bit of "depth" to them in order to encourage curiosity and thought from the user.

Additionally, the excitement-small dataset is designed to teach the model how to use "excited" language conversationally. This dataset was used on both Apollo Astralis models, which effectively demonstrate general excitement during user interaction.

VANTA Research is an AI safety project which aims to research and develop language models aligned for all types of thinking. These datasets were created aligned with that mission, in addition to rigorous AI safety standards.
DmitryRyuminย 
posted an update 2 days ago
view post
Post
2631
๐Ÿš€๐Ÿ‘Œ๐ŸŒŸ New Research Alert - ICCV 2025 (Oral)! ๐ŸŒŸ๐ŸคŒ๐Ÿš€
๐Ÿ“„ Title: Understanding Co-speech Gestures in-the-wild ๐Ÿ”

๐Ÿ“ Description: JEGAL is a tri-modal model that learns from gestures, speech and text simultaneously, enabling devices to interpret co-speech gestures in the wild.

๐Ÿ‘ฅ Authors: @sindhuhegde , K R Prajwal, Taein Kwon, and Andrew Zisserman

๐Ÿ“… Conference: ICCV, 19 โ€“ 23 Oct, 2025 | Honolulu, Hawai'i, USA ๐Ÿ‡บ๐Ÿ‡ธ

๐Ÿ“„ Paper: Understanding Co-speech Gestures in-the-wild (2503.22668)

๐ŸŒ Web Page: https://www.robots.ox.ac.uk/~vgg/research/jegal
๐Ÿ“ Repository: https://github.com/Sindhu-Hegde/jegal
๐Ÿ“บ Video: https://www.youtube.com/watch?v=TYFOLKfM-rM

๐Ÿš€ ICCV-2023-25-Papers: https://github.com/DmitryRyumin/ICCV-2023-25-Papers

๐Ÿš€ Added to the Human Modeling Section: https://github.com/DmitryRyumin/ICCV-2023-25-Papers/blob/main/sections/2025/main/human-modeling.md

๐Ÿ“š More Papers: more cutting-edge research presented at other conferences in the DmitryRyumin/NewEraAI-Papers curated by @DmitryRyumin

๐Ÿ” Keywords: #CoSpeechGestures #GestureUnderstanding #TriModalRepresentation #MultimodalLearning #AI #ICCV2025 #ResearchHighlight
prithivMLmodsย 
posted an update 3 days ago
view post
Post
2270
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on โ†—๏ธ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of strangervisionhf

It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.