We aim to identify the right metrics for evaluating Judge LLMs and understand their sensitivities to prompt guidelines, engineering, and specificity. With this paper, we want to raise caution โ ๏ธ to blindly using LLMs as human proxy.
Together MoA is a really interesting approach based on open source models!
"We introduce Mixture of Agents (MoA), an approach to harness the collective strengths of multiple LLMs to improve state-of-the-art quality. And we provide a reference implementation, Together MoA, which leverages several open-source LLM agents to achieve a score of 65.1% on AlpacaEval 2.0, surpassing prior leader GPT-4o (57.5%)."