Transformers
Safetensors
llama
speculative-decoding
eagle3
draft-model
kimi-k2.5
fp8
amd-quark
quantized
no-lm-head-quantization
text-generation-inference
quark
Instructions to use amd/Kimi-K2.5-Eagle3-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Kimi-K2.5-Eagle3-FP8 with Transformers:
# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") model = LlamaForCausalLMEagle3.from_pretrained("amd/Kimi-K2.5-Eagle3-FP8") - Notebooks
- Google Colab
- Kaggle
Uppercase model name, set Quark version to v0.12, add tokenizer files
#5
by larryli2 - opened
Three combined changes:
- Model card: capitalize the model name to Kimi-K2.5-Eagle3-FP8 (all occurrences).
- Model card: shorten the AMD Quark version to v0.12 wherever it appeared (Model Optimizer line, quantization details, environment table).
- Add the tokenizer bundle so the documented
AutoTokenizer.from_pretrained(..., trust_remote_code=True)works and matches the moonshotai/Kimi-K2.5 target tokenizer used for Eagle3 speculative decoding: tokenizer_config.json, tiktoken.model, tokenization_kimi.py, tool_declaration_ts.py (imported by tokenization_kimi.py), and chat_template.jinja. bos=[BOS] 163584 / eos=[EOS] 163585 match this model's config.json. Verified the tokenizer loads as TikTokenTokenizer and encodes/applies the chat template correctly. Multimodal/MoE/vision modeling files from the target were intentionally not copied (this draft is a text-only LlamaForCausalLMEagle3); the target's generation_config.json was also skipped because its eos_token_id (163586) conflicts with this model's config (163585).
larryli2 changed pull request status to closed