amd
/

Kimi-K2.5-Eagle3-FP8

speculative-decoding

no-lm-head-quantization

text-generation-inference

Model card Files Files and versions

Add vLLM reproduction guide and Eagle3 speculative-decoding results

#3

by larryli2 - opened 9 days ago

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Add vLLM reproduction guide and Eagle3 speculative-decoding resultsa2cbf1fa

Add two sections to the model card so others can reproduce the numbers:

'Reproduction': vLLM EAGLE3 speculative-decoding recipe on AMD Instinct MI355X (Docker images, ROCm/AITER env vars, vllm serve with --speculative-config, and a vllm bench serve throughput sweep). Target amd/Kimi-K2.5-MXFP4, FP8 draft amd/kimi-k2.5-eagle3-fp8, TP=4, ISL/OSL=1K/1K.
'Results': no-spec vs BF16 Eagle3 vs FP8 Eagle3 tok/s/GPU at concurrency 4/8/16/32/64.
Addition only; existing sections are unchanged.

Set benchmark --random-range-ratio to 0.8d37d24c9

chaoli-amd changed pull request status to merged 9 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment