Transformers
Safetensors
llama
speculative-decoding
eagle3
draft-model
kimi-k2.5
fp8
amd-quark
quantized
no-lm-head-quantization
text-generation-inference
quark
Instructions to use amd/Kimi-K2.5-Eagle3-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Kimi-K2.5-Eagle3-FP8 with Transformers:
# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") model = LlamaForCausalLMEagle3.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") - Notebooks
- Google Colab
- Kaggle
Add vLLM reproduction guide and Eagle3 speculative-decoding results
#3
by larryli2 - opened
Add two sections to the model card so others can reproduce the numbers:
- 'Reproduction': vLLM EAGLE3 speculative-decoding recipe on AMD Instinct MI355X (Docker images, ROCm/AITER env vars,
vllm servewith--speculative-config, and avllm bench servethroughput sweep). Target amd/Kimi-K2.5-MXFP4, FP8 draft amd/kimi-k2.5-eagle3-fp8, TP=4, ISL/OSL=1K/1K. - 'Results': no-spec vs BF16 Eagle3 vs FP8 Eagle3 tok/s/GPU at concurrency 4/8/16/32/64.
Addition only; existing sections are unchanged.
chaoli-amd changed pull request status to merged