Papers
arxiv:2403.03870

Learning to Decode Collaboratively with Multiple Language Models

Published on Mar 6
Ā· Featured in Daily Papers on Mar 7
Authors:
,
,

Abstract

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the marginal likelihood of a training set under our latent variable model, the base LLM automatically learns when to generate itself and when to call on one of the ``assistant'' language models to generate, all without direct supervision. Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand. Our collaborative decoding is especially useful in cross-domain settings where a generalist base LLM learns to invoke domain expert models. On instruction-following, domain-specific QA, and reasoning tasks, we show that the performance of the joint system exceeds that of the individual models. Through qualitative analysis of the learned latent decisions, we show models trained with our method exhibit several interesting collaboration patterns, e.g., template-filling. Our code is available at https://github.com/clinicalml/co-llm.

Community

I've been thinking of ways to implement a "rewind" function but your deferral function is a lot more elegant than what I've been contemplating. Really nice work, this is the real MoE. This could be incredibly useful, especially as models become smaller and we can afford to to put many of them on the GPU.

Do you think this can or could work with a single model that "hot swaps" lora type modules on the fly? That would be interesting...

šŸ‘

Ā·
Paper author

Thanks a lot for your interest!

a single model that "hot swaps" lora type modules on the fly

You might be interested in PHATGOOSE, which is essentially this idea!

I also wonder whether this could use a single model, with different hyperparameters, in such a way as to reduce hallucinations?

how is this different from mixture of experts ?

Ā·
Paper author

That depends on what you mean by mixture of experts! It is a mixture of experts in the broad sense that it combines the generations of multiple models. But there are several differences to other MoE work, which we tried to outline in Section 6. The key is that this approach works with off-the-shelf experts and makes no assumptions about how the assistant and the base model should interact, allowing the interaction pattern to be learned from data.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2403.03870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.03870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2403.03870 in a Space README.md to link it from this page.

Collections including this paper 7