I was chatting with @peakji , one of the cofounders of Manu AI, who told me he was on Hugging Face (very cool!).
He shared an interesting insight which is that agentic capabilities might be more of an alignment problem rather than a foundational capability issue. Similar to the difference between GPT-3 and InstructGPT, some open-source foundation models are simply trained to 'answer everything in one response regardless of the complexity of the question' - after all, that's the user preference in chatbot use cases. Just a bit of post-training on agentic trajectories can make an immediate and dramatic difference.
As a thank you to the community, he shared 100 invite code first-come first serve, just use “HUGGINGFACE” to get access!
Chain-of-Thought (CoT) prompting enhances reasoning in AI models by breaking down complex problems into step-by-step logical sequences. It continues proving its effectiveness, especially in top-performing reasoning models. However, there are other similar methods, that expand CoT and can be used for different purposes. Here are 9 of them:
4. Chain-of-RAG ->https://huggingface.co/papers/2501.14342 Creates retrieval chains, instead of retrieving all info at once. It can dynamically adjust its search process and its parameters like step number
9. Chain(s)-of-Knowledge -> https://www.turingpost.com/p/cok Enhance LLMs by dynamically pulling in external knowledge to improve accuracy and reduce errors
✨Apache 2.0 ✨8.19GB VRAM, runs on most GPUs ✨Multi-Tasking: T2V, I2V, Video Editing, T2I, V2A ✨Text Generation: Supports Chinese & English ✨Powerful Video VAE: Encode/decode 1080P w/ temporal precision
This week we are releasing the first framework unit in the course and it’s on smolagents. This is what the unit covers:
- why should you use smolagents vs another library? - how to build agents that use code - build multiagents systems - use vision language models for browser use
The team has been working flat out on this for a few weeks. Led by @sergiopaniego and supported by smolagents author @m-ric.
It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: toloka/u-math-leaderboard)
tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability: * performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025) * R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests) * same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving) * R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)
Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here