🌁#90: Why AI’s Reasoning Tests Keep Failing Us
we discuss benchmark problems, such as benchmark saturation, and explore potential solutions. And as always, we offer a curated list of relevant news and important papers to keep you informed
--
This Week in Turing Post:
- Wednesday, AI 101, Techniques: Everything you need to know about Knowledge Distillation
- Friday, Agentic Workflow: Action and Tools
🔳 Turing Post is on 🤗 Hugging Face as a resident -> click to follow!
The race to build ever-smarter AI has led to a paradox: the benchmarks we use to measure progress are breaking down almost as fast as the models improve. Just a few years ago, the BIG-Bench Hard (BBH) dataset was a gold standard for evaluating reasoning in large language models (LLMs). Today, it’s essentially obsolete. The latest AI models – GPT-4o, Gemini, DeepSeek – have aced it, reducing what was once a rigorous test into a mere formality. In response, researchers have introduced BIG-Bench Extra Hard (BBEH), a new benchmark designed to push AI reasoning to its limits. But if history is any guide, BBEH too will be “solved” sooner than we expect. And then what?
This cycle of benchmark saturation is one of the biggest hurdles in AI evaluation. Every time researchers devise a new test, models quickly adapt, often through methods that have little to do with true reasoning. AI labs optimize their models to dominate the leaderboard, fine-tuning responses to fit benchmark formats rather than improving genuine cognitive abilities. This is a classic case of Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure.
Beyond saturation, there’s an even bigger problem: we’re measuring the wrong things. Most reasoning benchmarks heavily favor math and coding tasks because they have clear right and wrong answers. But being able to solve an algebra problem doesn’t mean an AI can navigate real-world ambiguity, make causal inferences, or understand human motivations. A model that can write perfect Python scripts might still fail at answering a nuanced ethical dilemma or interpreting sarcasm in a conversation. Yet, because math and programming are easy to score, they continue to dominate AI evaluations, giving us a skewed sense of progress.
Even when benchmarks try to cover broader reasoning skills, they face a different issue: models exploit superficial shortcuts instead of truly reasoning through problems. AI is great at pattern recognition, often identifying statistical cues in datasets rather than solving tasks in a human-like way. For example, if a benchmark always frames logical deduction problems in a similar format, the model can memorize patterns instead of actually performing reasoning. This illusion of competence is one reason LLMs still stumble when presented with unfamiliar real-world challenges.
The implications of weak evaluation methods extend beyond research labs. AI models are already being integrated into critical applications – healthcare, legal analysis, customer service – where reasoning skills matter. If our benchmarks don’t accurately reflect real-world reasoning demands, we risk deploying models that appear highly capable but fail in unpredictable and costly ways. Worse, businesses and policymakers may overestimate AI’s cognitive abilities based on misleading benchmark scores, leading to misplaced trust in automated decision-making.
So how do we build better benchmarks? The answer lies in diversity, adaptability, and real-world testing. Instead of relying on fixed datasets that quickly become outdated, AI evaluations should incorporate dynamic and adversarial testing, where new, unseen problems continuously challenge models. Benchmarks must also expand beyond math and coding to cover commonsense reasoning, causal inference, and ethical decision-making. Finally, real-world performance needs to be the ultimate metric – how well does an AI assist doctors, guide autonomous systems, or navigate complex social interactions?
BBEH is a step in the right direction, but it’s just the latest chapter in a long story. The challenge is to make benchmarks not only harder, but also smarter. If AI is to truly reason, we need to rethink how we test it. Otherwise, we’ll keep mistaking test-taking ability for intelligence – and that’s a dangerous illusion to fall for.
Curated Collections
We are reading/watching:
- A very insightful read from Nathan Lambert – “Character Training: Understanding and Crafting a Language Model’s Personality.” In our series, we initially referred to this as Profiling. In hindsight, that was not the best term! While Character Training doesn’t fully capture the complexity of profiling, it is a more commonly used phrase, as of now.
Recommendation from an AI practitioner
- Imagine having a precision scalpel for dissecting LLMs – open-sourced LLM-Microscope reveals token nonlinearity, memory depth, layer insights, and representation complexity.
News from The Usual Suspects ©
6 DeepSeek’s Extraordinary Deliveries during #OpenSourceWeek
- DeepSeek delivered six major open-source AI optimizations this week, showcasing efficiency and scalability in LLM development. FlashMLA (already over 11k stars on GitHub!) optimized Multi-head Latent Attention (MLA) for Hopper GPUs, achieving 3000 GB/s memory bandwidth and 580 TFLOPS compute. DeepEP introduced a new MoE communication library to improve expert model efficiency. DeepGEMM, an FP8 GEMM library, hit 1350+ TFLOPS, outperforming expert-tuned kernels. Optimized parallelism strategies enhanced workload distribution in large-scale AI training, while Fire-Flyer File System (3FS) streamlined high-performance AI data management. On the 6th day, they published a deep dive into DeepSeek-V3/R1 Inference System. Worth reading!
- DeepSeek’s advancements in AI have also brought media attention to distillation. Funny to see it making headlines – guess even optimization tricks are getting their 15 minutes of fame. We will be covering KD on Wednesday!
Anthropic Levels Up: Smarter AI, Bigger Deals, and Total Transparency
- Anthropic is making big moves. With Claude 3.7 Sonnet, users can now control how deeply it thinks – whether solving complex problems or even playing Pokémon.
- Meanwhile, the new Transparency Hub lays out safety measures and governance policies as AI regulations tighten. And in science? Anthropic is teaming up with the U.S. Department of Energy to test AI’s role in national security and research.
- All this momentum, plus a fresh $3.5B Series E at a staggering $61.5B valuation. Dario and Daniela Amodei just appeared in The Times, prophesying that “By next year, AI could be smarter than all humans”.
Google’s AI Playbook: Harder Work, Smarter Code, and an AI Co-Scientist
- Silicon Valley’s AI arms race is pushing limits – both human and machine. Sergey Brin wants 60-hour workweeks for Google's Gemini AI team, calling it the "sweet spot of productivity."
- To get more developers onboard, Google is making AI-powered coding assistance free for all with Gemini Code Assist, offering up to 180,000 completions per month – a massive leap over existing tools. Now available in VS Code, JetBrains, and GitHub, it doesn’t just write code but also reviews pull requests and adapts to custom style guides.
- And in the lab? Google’s AI co-scientist, built on Gemini 2.0, is generating and refining scientific hypotheses – already making breakthroughs in biomedical research by uncovering new drug candidates and gene transfer mechanisms. Maybe it can also figure out how to make humans work without rest, just like AI.
Another day, another quantum achievement
- Researchers from AWS Center for Quantum Computing developed a hardware-efficient quantum error correction (QEC) scheme using concatenated bosonic qubits. Their system integrates bosonic cat qubits with a distance-5 repetition code, reducing the overhead required for fault-tolerant quantum computing. The approach suppresses bit-flip errors passively while an outer repetition code corrects phase-flip errors.
Models to pay attention to:
- NeoBERT: A Next-Generation BERT – Modernizes bidirectional encoders with architectural upgrades (RoPE, SwiGLU, RMSNorm) and extended context length, surpassing BERT-large and RoBERTa-large while improving inference speed
- IBM Granite 3.2: Reasoning, Vision, Forecasting, and More – Introduces open-source models with enhanced reasoning, vision-language, and forecasting capabilities, outperforming larger proprietary models in multiple domains
- Kanana: Compute-Efficient Bilingual Language Models – Optimizes Korean-English bilingual models for lower computational cost while outperforming LLaMA 3.1 70B on Korean benchmarks
- SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers – Demonstrates that structured reasoning boosts LLM accuracy on polynomial nonnegativity problems, outperforming much larger models with minimal computation
- Conversational Speech Model (CSM) – an end-to-end multimodal approach leveraging transformers to generate expressive, context-aware speech by integrating text and audio representations, optimizing latency, and advancing conversational AI beyond traditional text-to-speech methods →read their blog
The freshest research papers, categorized for your convenience
There were quite a few TOP research papers this week, we will mark them with 🌟 in each section.
- 🌟 Beyond Release: Access Considerations for Generative AI Systems – Analyzes the practical challenges of open-access AI models, including API pricing, hosting costs, and accessibility barriers
LLM Optimization and Training Stability
- 🌟 SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution – Introduces SWE-RL, a reinforcement learning framework improving LLM reasoning for software engineering tasks, surpassing supervised fine-tuning methods
- Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models – Develops a technique to stabilize LLM training by separating weight matrix scale from distribution, improving gradient stability and convergence speed
- Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment – Enhances LoRA efficiency using an optimized Mixture-of-Experts framework with adaptive SVD priors, outperforming traditional fine-tuning methods
- The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve? – Suggests LLM compression should focus on preserving reasoning and retrieval capabilities rather than mere token efficiency
- 🌟 LongRoPE2: Near-Lossless LLM Context Window Scaling – Proposes an advanced RoPE rescaling method that extends LLM context windows to 128K tokens while maintaining short-context performance
Efficiency and Optimization
- Thus Spake Long-Context Large Language Model – Explores advancements in long-context LLMs, detailing improvements in KV cache optimization, memory management, and inference efficiency
- 🌟 Chain of Draft: Thinking Faster by Writing Less – Introduces a concise reasoning method that reduces token usage and latency without sacrificing accuracy
Reasoning and Multi-Step Problem Solving
- Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? – Introduces DeltaBench, a dataset revealing LLMs' struggles with detecting errors in multi-step reasoning processes
- Self-rewarding Correction for Mathematical Reasoning – Develops a reinforcement learning-based correction framework that enhances LLM accuracy in solving mathematical problems
- 🌟 TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding – Introduces an agent that generates animated multimodal reasoning content for STEM topics
- Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation – Evaluates LLMs' ability to generate counterexamples, revealing weaknesses in self-correction and verification
RAG and Information Processing
- Rank1: Test-Time Compute for Reranking in Information Retrieval – Introduces a reranking method that enhances retrieval relevance by leveraging test-time compute
- TeleRAG: Efficient Retrieval-Augmented Generation Inference With Lookahead Retrieval – Reduces RAG inference latency by prefetching relevant data during LLM generation, optimizing retrieval efficiency
- 🌟 LettuceDetect: A Hallucination Detection Framework for RAG Applications – Develops a lightweight hallucination detection system that outperforms larger models while maintaining high processing speed
AI Agents and Automated Scientific Experimentation
- Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents – Introduces an AI agent enforcing rigor in scientific experimentation through automated hypothesis testing and result validation
- 🌟 Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs – Proposes Knowledge Units (KUs) as a structured method for AI-driven scientific knowledge extraction while avoiding copyright issues
Reinforcement Learning and Policy Optimization
- FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users – Introduces a few-shot learning method to personalize LLMs based on synthetic user preference data
- Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance – Proposes a reinforcement learning optimization method that improves efficiency while reducing computational overhead
Security and AI Alignment
- Guardians of the Agentic System: Preventing Many-Shots Jailbreak with Agentic System – Develops a security framework to prevent AI jailbreak attempts through multi-agent alignment techniques
- On Relation-Specific Neurons in Large Language Models – Investigates relation-specific neurons in LLMs, identifying their role in structured knowledge recall and potential interference effects
Compression, Inference, and Cost Optimization
- 🌟 Optimal Brain Apoptosis – Proposes a novel neural pruning technique that significantly accelerates inference while preserving accuracy
- Towards Optimal Multi-Draft Speculative Decoding – Optimizes speculative decoding in LLMs by improving draft verification efficiency, reducing inference costs
That’s all for today. Thank you for reading!
Please share this article with your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve.