Nanotron Research

community

Activity Feed Request to join this org

AI & ML interests

Large scale distributed AI model training, model parallelisation, low-level GPU acceleration, make GPUs go brrrrr

Recent Activity

thomwolf authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

nouamanetazi authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

eliebak authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

View all activity

nanotron's activity

thomwolf

authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 6 days ago • 154

nouamanetazi

authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 6 days ago • 154

eliebak

authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 6 days ago • 154

lvwerra

authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 6 days ago • 154

loubnabnl

authored a paper 5 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 6 days ago • 154

thomwolf

authored a paper 5 days ago

YourBench: Easy Custom Evaluation Sets for Everyone

Paper • 2504.01833 • Published 11 days ago • 16

thomwolf

posted an update 14 days ago

Post

3131

The new DeepSite space is really insane for vibe-coders
enzostvs/deepsite

With the wave of vibe-coding-optimized LLMs like the latest open-source DeepSeek model (version V3-0324), you can basically prompt out-of-the-box and create any app and game in one-shot.

It feels so powerful to me, no more complex framework or under-the-hood prompt engineering to have a working text-to-app tool.

AI is eating the world and *open-source* AI is eating AI itself!

PS: and even more meta is that the DeepSite app and DeepSeek model are both fully open-source code => time to start recursively improve?

PPS: you still need some inference hosting unless you're running the 600B param model at home, so check the very nice list of HF Inference Providers for this model: deepseek-ai/DeepSeek-V3-0324

1 reply

nouamanetazi

updated a dataset about 1 month ago

nanotron/ultrascale-playbook-data

Updated Mar 12 • 4.83k • 5

nouamanetazi

updated a Space about 1 month ago

Predict Memory

🧮

Calculate memory usage from model configurations

thomwolf

posted an update about 1 month ago

Post

2799

We've kept pushing our Open-R1 project, an open initiative to replicate and extend the techniques behind DeepSeek-R1.

And even we were mind-blown by the results we got with this latest model we're releasing: ⚡️OlympicCoder ( open-r1/OlympicCoder-7B and open-r1/OlympicCoder-32B)

It's beating Claude 3.7 on (competitive) programming –a domain Anthropic has been historically really strong at– and it's getting close to o1-mini/R1 on olympiad level coding with just 7B parameters!

And the best part is that we're open-sourcing all about its training dataset, the new IOI benchmark, and more in our Open-R1 progress report #3: https://huggingface.co/blog/open-r1/update-3

Datasets are are releasing:
- open-r1/codeforces
- open-r1/codeforces-cots
- open-r1/ioi
- open-r1/ioi-test-cases
- open-r1/ioi-sample-solutions
- open-r1/ioi-cots
- open-r1/ioi-2024-model-solutions

eliebak

posted an update about 1 month ago

Post

1642

Google just dropped an exciting technical report for the brand-new Gemma3 model! 🚀 Here are my personal notes highlighting the most intriguing architectural innovations, design choices, and insights from this release:

1) Architecture choices:
> No more softcaping, replace by QK-Norm
> Both Pre AND Post Norm
> Wider MLP than Qwen2.5, ~ same depth
> SWA with 5:1 and 1024 (very small and cool ablation on the paper!)
> No MLA to save KV cache, SWA do the job!

2) Long context
> Only increase the rope in the global layer (to 1M)
> Confirmation that it's harder to do long context for smol models, no 128k for the 1B
> Pretrained with 32k context? seems very high
> No yarn nor llama3 like rope extension

3) Distillation
> Only keep te first 256 logits for the teacher
> Ablation on the teacher gap (tl;dr you need some "patience" to see that using a small teacher is better)
> On policy distillation yeahh (by
@agarwl_
et al), not sure if the teacher gap behave the same here, curious if someone have more info?

4) Others
> Checkpoint with QAT, that's very cool
> RL using improve version of BOND, WARM/WARP good excuse to look at
@ramealexandre
papers
> Only use Zero3, no TP/PP if i understand correctly ?
> Training budget relatively similar than gemma2