Papers
arxiv:2401.15077

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Published on Jan 26
· Featured in Daily Papers on Jan 29
Authors:
,
,
,

Abstract

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.

Community

🤯

Instead of employing text generated by the target LLM, we utilize a fixed dataset, substantially reducing the overhead.

Why not try to aim to be more faithful to the target LLM? And how would you feed the second-top-layer feature activation (Figure 7) to the draft model without utilizing the target LLM during training?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding" where a smaller draft model is used to accelerate the process of obtaining predictions from the main draft model?
Or alternatively, combine it with the "Prompt Lookup Decoding" method

Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding"

+1, as in the following two ideas from that paper:

  1. Use a shallow but wide tree to increase parallelism during verification. My reading of EAGLE is that this is already a big part of their speedup when parallel batch verification is enabled (e.g. see Figure 7)
  2. Having staged speculation within the smaller draft model. I think this is a great idea, the EAGLE draft model can combine with a second lightweight (e.g. n-gram described in your Prompt Lookup Decoding) approach. I believe the thinking goes that the deeper the speculation, the more tentative they are, so invest less resources in computing them.

Perhaps it might be worth using the fast feed forward in the draft model to make it faster?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.15077 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.15077 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.15077 in a Space README.md to link it from this page.

Collections including this paper 7