AutoBench Leaderboard
Multi-run AutoBench leaderboard with historical navigation
More details for this benchmark, including score, cost, and latency data, in the AutoBench Leaderboard.
The era of generic "chatbot" evaluation is ending. Professionals don't just need a model that can chat; they need one that can think like an expert in their specific field.
Today, in collaboration with **EVJA**βa global leader in agronomic intelligenceβwe are thrilled to announce the release of a groundbreaking new AutoBench run focused entirely on Agronomy.
Released simultaneously with EVJAβs new Agronomic Intelligence Platform, this benchmark is designed to guide the use of LLMs by professionals in the agricultural sector, testing models on everything from crop disease pathology and soil chemistry to farm logistics and carbon footprint analysis.
This is the first vertical corporate domain-specific LLM benchmark that we make public, a testament to the exceptioonal flexibility of AutoBench in scoring LLM performance in any domain.
General-purpose AI developers cannot effectively judge specialized agricultural knowledge alone. To ensure this benchmark reflects the real-world needs of the industry, we partnered with EVJA.
Based in Naples and Wageningen (the "Food Valley" of the Netherlands), EVJA has been a pioneer in using sensors and predictive models to optimize crop management since 2015. Their team provided the technical and agronomic support necessary to design the test cases, ensuring that our "judge" models were evaluating data that matters to actual farmers and agronomists.
"Many agritech companies boast about AI solutions, but applications have been limited until now. Our new platform puts AI at the center of agronomic operations," says Davide Parisi, CEO of EVJA. "This benchmark is the natural extension of that vision, helping the industry navigate the complex landscape of LLMs."
Explore the benchmark details in the AutoBench Leaderboard.
To tackle the complexity of modern agriculture, we pushed AutoBench to new limits. This Agronomy run represents our most comprehensive evaluation to date:
Agriculture isn't a monolith. With EVJA's guidance, we structured the benchmark around four distinct professional personas to mirror the diverse users of their new Agronomic Intelligence Platform:
The results are in, and they paint a fascinating picture for the sector.
When it comes to pure reasoning and domain knowledge in the proprietary space, OpenAI remains the one to beat.
The biggest shock of this run comes from the open-weight world. Mistral Large 2512 has emerged as the top open-source model, performing on par with SOTA (State of the Art) proprietary models from Anthropic and Google (Gemini).
For developers building on-premise ag-tech solutions or privacy-focused applications, Mistral Large 2512 is now the gold standard.
For high-volume agentic workflowsβwhere cost is as critical as accuracyβwe found two exceptional performers that break the price-performance barrier:
These models struck incredible results, delivering SOTA-level performance at costs 2 orders of magnitude lower than the proprietary giants.
The diffusion of LLMs now involves every sector. More and more companies and professionals in agriculture are adopting AI for multiple activities, from drafting documents to data analysis and agronomic consulting.
By combining AutoBench's scientifically validated "Collective-LLM-as-a-Judge" methodology (proven to have ~92% accuracy correlation with leading benchmarks such as LMarena and Artificla Analysis Intelligence Index) with EVJA's deep domain authority, we are providing the transparency needed to make those adoptions successful.
Dive into the interactive leaderboard to filter by model, check the price-performance graphs, and see how your favorite models rank on specific agricultural tasks.
π View the AutoBench Agronomy Leaderboard
π Learn more about AutoBench
π Contribute to AutoBench opensource Hugging Face resources
π Learn more about EVJA's new Agronomic Intelligence Platform
Multi-run AutoBench leaderboard with historical navigation