The outcome is quite sad, as a Frenchman and European.
The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.
American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬
⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.
First, what is "LLM-as-a-judge"? 👉 It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilà, you get your score.
🧐 But who judges the judge? How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.
📊 Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!
And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!
The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts
🔍 Meta teams use a fine-tuned Llama model to fix production issues in seconds
One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.
🤔 42%, is that not too low? ➡️ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack. 💡 So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!
How did they do it?
🔄 Two-step approach: ‣ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set ‣ Fine-tuned Llama 2 7B ranks the most likely culprits
🎓 Training pipeline: ‣ Continued pre-training on Meta's internal docs and wikis ‣ Supervised fine-tuning on past incident investigations ‣ Training data mimicked real-world constraints (2-20 potential changes per incident)
🔮 Now future developments await: ‣ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems) ‣ Improvements in model reasoning should boost accuracy further
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1