EAGLE3 acceptance rate varies significantly across prompt distributions: seeking guidance on data/domain alignment

#3
by matoujin - opened

[The following content was translated with the help of AI.]

Hi, we are evaluating EAGLE3 speculative decoding with GLM-5.1 and noticed that the acceptance rate appears to be highly sensitive to the prompt distribution.

In our TP8 setup with num_speculative_tokens=1, code-generation style prompts show relatively good position-0 acceptance. For example, on HumanEval-style prompts, we observed acceptance rates around 20%+ in many cases.

However, on our production-like business prompts, the acceptance rate is much lower:

  • Real business requests, max_tokens=2: 0 / 30 position-0 accepted tokens
  • Real business requests, max_tokens=64: 5 / 624, around 0.8%
  • General chat prompts also showed lower acceptance than code-generation prompts

The implementation itself seems to be working, since speculative decoding is active and acceptance is visible on code-generation prompts. So we are trying to understand whether this is mainly caused by a domain/data-distribution mismatch between the EAGLE3 draft model and our real traffic.

Could you share some guidance on the following?

  1. What kind of data distribution was used to train or distill this EAGLE3 draft model?
  2. Is it expected that code-generation benchmarks have much higher acceptance than long multi-turn business/chat prompts?
  3. Are long multi-turn prompts or production-style chat data included in the draft model training distribution?
  4. Would continued training or distillation on our own traffic distribution be the recommended way to improve acceptance?
  5. Are there suggested diagnostics to distinguish data-distribution mismatch from chat-template/token-alignment issues?
    We are not sure yet whether this is expected behavior or a sign that our serving distribution is not well matched to the current draft model.

Any suggestions would be appreciated.

En, It is mainly caused by a domain/data-distribution mismatch between the EAGLE3 draft model and our real traffic.
It is recommended to regenerate the dataset with GLM-5.1 using your own traffic distribution, and then perform continued training and fine-tuning. After continued training on our own Chinese business dataset, the accepted length improved: with 4 speculative tokens, the average accepted length was around 2.7–3.0 tokens.

Sign up or log in to comment