EAGLE3 acceptance rate varies significantly across prompt distributions: seeking guidance on data/domain alignment
[The following content was translated with the help of AI.]
Hi, we are evaluating EAGLE3 speculative decoding with GLM-5.1 and noticed that the acceptance rate appears to be highly sensitive to the prompt distribution.
In our TP8 setup with num_speculative_tokens=1, code-generation style prompts show relatively good position-0 acceptance. For example, on HumanEval-style prompts, we observed acceptance rates around 20%+ in many cases.
However, on our production-like business prompts, the acceptance rate is much lower:
- Real business requests, max_tokens=2: 0 / 30 position-0 accepted tokens
- Real business requests, max_tokens=64: 5 / 624, around 0.8%
- General chat prompts also showed lower acceptance than code-generation prompts
The implementation itself seems to be working, since speculative decoding is active and acceptance is visible on code-generation prompts. So we are trying to understand whether this is mainly caused by a domain/data-distribution mismatch between the EAGLE3 draft model and our real traffic.
Could you share some guidance on the following?
- What kind of data distribution was used to train or distill this EAGLE3 draft model?
- Is it expected that code-generation benchmarks have much higher acceptance than long multi-turn business/chat prompts?
- Are long multi-turn prompts or production-style chat data included in the draft model training distribution?
- Would continued training or distillation on our own traffic distribution be the recommended way to improve acceptance?
- Are there suggested diagnostics to distinguish data-distribution mismatch from chat-template/token-alignment issues?
We are not sure yet whether this is expected behavior or a sign that our serving distribution is not well matched to the current draft model.
Any suggestions would be appreciated.
En, It is mainly caused by a domain/data-distribution mismatch between the EAGLE3 draft model and our real traffic.
It is recommended to regenerate the dataset with GLM-5.1 using your own traffic distribution, and then perform continued training and fine-tuning. After continued training on our own Chinese business dataset, the accepted length improved: with 4 speculative tokens, the average accepted length was around 2.7β3.0 tokens.