Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
Abstract
Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.
Community
Rethinking Mixture-of-Agents
This work investigates whether mixing different LLMs is truly beneficial.
They also propose Self-MoA, an ensemble method that aggregates outputs from only the single top-performing LLM.
Self-MoA outperforms standard MoA, which mixes different LLMs in a large number of scenarios.
Self-MoA leverages in-model diversity and synthesizes multiple outputs from the same model.
What's the issue with MoA?
MoA performance is sensitive to output quality, and mixing different LLMs often lowers the average quality of the models.
An implementation of single LLM moa is available in optillm - https://github.com/codelion/optillm/blob/main/optillm/moa.py
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning (2025)
- Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks (2024)
- KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models (2024)
- LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion (2025)
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling (2025)
- Hint Marginalization for Improved Reasoning in Large Language Models (2024)
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Summary of "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?"
Objective
This paper critically examines the effectiveness of the Mixture-of-Agents (MoA) approach, which ensembles output from multiple Large Language Models (LLMs). It introduces Self-MoA, an alternative method aggregating outputs from only a single top-performing model rather than multiple diverse models. The research challenges the assumption that diversity in proposers necessarily leads to better performance.
Methods
Comparing Mixed-MoA vs. Self-MoA:
- MoA traditionally combines responses from multiple LLMs and aggregates them into a final output.
- Self-MoA generates multiple outputs from the same high-performing LLM and aggregates those instead.
Performance Benchmarks:
- Evaluated on AlpacaEval 2.0, MMLU, CRUX, and MATH datasets.
- Compared quality vs. diversity trade-offs across 200+ experiments.
- Introduced Self-MoA-Seq, a sequential version that scales up aggregation without exceeding context length constraints.
Findings
Self-MoA consistently outperforms Mixed-MoA.
- Achieved 6.6% better performance than Mixed-MoA on AlpacaEval 2.0.
- Improved 3.8% on average across multiple benchmarks.
Quality trumps diversity.
- Mixing LLMs often reduces the average quality due to the inclusion of lower-performing models.
- Intra-model diversity (sampling multiple outputs from the same model) is more beneficial than inter-model diversity.
Cross-model diversity only helps in particular cases.
- When models perform similarly, diversity can marginally improve results.
- However, the gain is small (~0.3%), making Self-MoA the more reliable approach.
Scaling inference compute with Self-MoA-Seq:
- Instead of aggregating all outputs simultaneously, Self-MoA-Seq uses a sliding window method to synthesize results gradually.
- This allows models with shorter context lengths to still benefit from large-scale ensembling.
Why Is This Paper Unique?
This research challenges a fundamental assumption in LLM ensemblingโthat more diverse models equal better performance. Unlike previous work emphasizing cross-model diversity, this paper provides empirical evidence that quality consistency within a single model is more effective.
Key Contributions That Make It Novel:
Redefines the Quality-Diversity Trade-off:
- Past research assumes diverse models yield better outputs.
- This paper quantifies how diversity can reduce final quality.
Introduces Self-MoA as a More Efficient Alternative:
- Instead of adding new models, it reuses the best model's outputs, making it computationally efficient.
- Lowers deployment costs by avoiding unnecessary model inference.
Brings a New Perspective to Scaling Compute at Inference Time:
- Self-MoA-Seq allows ensembling without increasing memory footprint.
- This has implications for real-world applications, where context length limits models.
Implications for AGI Development
The findings from this paper could fundamentally shape the future of AI reasoning and AGI (Artificial General Intelligence) in several ways:
Efficiency Over Model Expansion:
- Instead of training bigger and more diverse models, AGI could refine knowledge within a single high-quality model.
- This means AGI could scale compute at inference time rather than training time, making future AI systems more energy-efficient.
Better Decision-Making in Autonomous Agents:
- If AGI is structured as an agent ensemble, Self-MoA suggests that multiple outputs from a strong model are preferable over diverse weaker agents.
- This aligns with human decision-making, where experience and refined judgment matter more than sheer variety.
Solves the "Alignment vs. Performance" Trade-off:
- Many AGI models face a challenge: should we prioritize alignment (safety) or performance (creativity)?
- This research shows that alignment can be preserved without sacrificing performance, as aggregating outputs from a single model maintains quality consistency.
Could Enable More Reliable Self-Improving AGI:
- If AGI is designed to learn from its past outputs, Self-MoA could enable self-critique and iteration without external feedback.
- This removes dependence on external evaluators, making AGI more autonomous in reasoning and self-correction.
Would Self-MoA Make AGI a "Good Thing"?
Pros:
โ
More Efficient โ Reduces computational waste in AI training and inference.
โ
More Reliable โ Avoids the risk of bad models introducing errors in an ensemble.
โ
Scales Safely โ Prevents degradation in output quality when increasing compute.
โ
Supports Alignment โ Keeps control mechanisms intact without sacrificing creativity.
Cons / Risks:
โ ๏ธ Risk of Overfitting to One Modelโs Biases โ If the "top model" has a flaw, Self-MoA could amplify it.
โ ๏ธ Less Diversity in Thought โ Some problems may require multiple reasoning pathways.
โ ๏ธ Reduces Model Competition โ If AGI follows this approach, fewer LLMs may be developed, consolidating power in a few key models.
Conclusion: The Future of AI with Self-MoA
This research challenges long-held assumptions about diversity in AI ensembles and introduces a more efficient, scalable, and controlled approach to AI decision-making. As AGI advances, Self-MoA could enable models to refine their reasoning processes efficiently, making future AI systems brighter, safer, and more practical.
The next steps could involve:
๐น Testing Self-MoA in autonomous agent settings (e.g., AI assistants, robotics).
๐น Applying Self-MoA to real-world AGI decision-making tasks.
๐น Ensuring safeguards are in place to prevent over-amplification of model biases.
Would you like a deeper dive into specific applications of Self-MoA in AGI research? ๐
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper