As one of the most popular local inference solutions, the community had been asking us to integrate vLLM: after a heavy refactoring of our LLM classes, we've just released smolagents 1.11.0, with a brand new VLLMModel class.
It's beating Claude 3.7 on (competitive) programming โa domain Anthropic has been historically really strong atโ and it's getting close to o1-mini/R1 on olympiad level coding with just 7B parameters!
And the best part is that we're open-sourcing all about its training dataset, the new IOI benchmark, and more in our Open-R1 progress report #3: https://huggingface.co/blog/open-r1/update-3
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.
Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**. (Which everybody does, but people usually don't say)
For a tech report, it makes a lot of sense to report model performance when used optimally! On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)
Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!
Because if your model knows its evals by heart, you're not testing for generalization.
If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova, this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โ
๐ GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐ช
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
๐ New smolagents update: Safer Local Python Execution! ๐ฆพ๐
With the latest release, we've added security checks to the local Python interpreter: every evaluation is now analyzed for dangerous builtins, modules, and functions. ๐
Here's why this matters & what you need to know! ๐งต๐
1๏ธโฃ Why is local execution risky? โ ๏ธ AI agents that run arbitrary Python code can unintentionally (or maliciously) access system files, run unsafe commands, or exfiltrate data.
2๏ธโฃ New Safety Layer in smolagents ๐ก๏ธ We now inspect every return value during execution: โ Allowed: Safe built-in types (e.g., numbers, strings, lists) โ Blocked: Dangerous functions/modules (e.g., os.system, subprocess, exec, shutil)
4๏ธโฃ Security Disclaimer โ ๏ธ ๐จ Despite these improvements, local Python execution is NEVER 100% safe. ๐จ If you need true isolation, use a remote sandboxed executor like Docker or E2B.
5๏ธโฃ The Best Practice: Use Sandboxed Execution ๐ For production-grade AI agents, we strongly recommend running code in a Docker or E2B sandbox to ensure complete isolation.
6๏ธโฃ Upgrade Now & Stay Safe! ๐ Check out the latest smolagents release and start building safer AI agents today.
๐ Big news for AI agents! With the latest release of smolagents, you can now securely execute Python code in sandboxed Docker or E2B environments. ๐ฆพ๐
Here's why this is a game-changer for agent-based systems: ๐งต๐
1๏ธโฃ Security First ๐ Running AI agents in unrestricted Python environments is risky! With sandboxing, your agents are isolated, preventing unintended file access, network abuse, or system modifications.
2๏ธโฃ Deterministic & Reproducible Runs ๐ฆ By running agents in containerized environments, you ensure that every execution happens in a controlled and predictable settingโno more environment mismatches or dependency issues!
3๏ธโฃ Resource Control & Limits ๐ฆ Docker and E2B allow you to enforce CPU, memory, and execution time limits, so rogue or inefficient agents donโt spiral out of control.
4๏ธโฃ Safer Code Execution in Production ๐ญ Deploy AI agents confidently, knowing that any generated code runs in an ephemeral, isolated environment, protecting your host machine and infrastructure.
5๏ธโฃ Easy to Integrate ๐ ๏ธ With smolagents, you can simply configure your agent to use Docker or E2B as its execution backendโno need for complex security setups!
6๏ธโฃ Perfect for Autonomous AI Agents ๐ค If your AI agents generate and execute code dynamically, this is a must-have to avoid security pitfalls while enabling advanced automation.
Getting WebRTC and Websockets right in python is very tricky. If you've tried to wrap an LLM in a real-time audio layer then you know what I'm talking about.
That's where FastRTC comes in! It makes WebRTC and Websocket streams super easy with minimal code and overhead.
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐ฅ
Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.
To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.
๐ฏ For the preparation part, a key part is find all the important references on the given subject. Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โAttributeTreeโ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!
๐ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.
As a result, their system outperforms previous approaches by far!
As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐คฏ
Do we really need o1's huge RL procedure to see reasoning emerge? It seems not. Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โno huge datasets or RL procedures needed.
Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.
โก The Less-is-More Reasoning Hypothesis: โฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity โฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills
โก๏ธ Core techniques: โฃ High-quality reasoning chains with self-verification steps โฃ 817 handpicked problems that encourage deeper reasoning โฃ Enough inference-time computation to allow extended reasoning
๐ช Efficiency gains: โฃ Only 817 examples instead of 100k+ โฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data
This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐
๐๐ฟ๐ฒ๐ฎ๐ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฎ๐น๐ฒ๐ฟ๐: you can now share agents to the Hub! ๐ฅณ๐ฅณ
And any agent pushed to Hub get a cool Space interface to directly chat with it.
This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.
"๐ฎ๐ฌ๐ฎ๐ฑ ๐๐ถ๐น๐น ๐ฏ๐ฒ ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ ๐ผ๐ณ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐": this statement has often been made, here are numbers to support it.
I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.
And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.