Mixture of Attentions for Speculative Decoding

This checkpoint was obtained from "Mixture of Attentions For Speculative Decoding" by Matthieu Zimmer*, Milan Gritta*, Gerasimos Lampouras, Haitham Bou Ammar, and Jun Wang. The paper introduces a novel architecture for speculative decoding that enhances the speed of large language model (LLM) inference.

It is supported in vLLM see our Github repository.

Checkpoints

Base Model	MOA Spec on Hugging Face	Base Model Parameters	MOA Spec Parameters
meta-llama/Meta-Llama-3-8B-Instruct	huawei-noah/MOASpec-Llama-3-8B-Instruct	8B	0.25B

Citation

If you use this code or this checkpoint in your research, please cite our paper:

@misc{zimmer2024mixtureattentionsspeculativedecoding,
      title={Mixture of Attentions For Speculative Decoding}, 
      author={Matthieu Zimmer and Milan Gritta and Gerasimos Lampouras and Haitham Bou Ammar and Jun Wang},
      year={2024},
      eprint={2410.03804},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03804}, 
}

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Disclaimer: This open source project is not an official Huawei product, Huawei is not expected to provide support for this project.