Instructions to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview
- SGLang
How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview
Kimi-K2.5-P48-NVFP4-W4A4-Preview
A preview W4A4 compressed checkpoint of moonshotai/Kimi-K2.5, combining paired-4:8 structured sparsity with NVFP4 weight + activation quantization on MoE expert weights — targeting NVIDIA Blackwell sparse tensor cores.
Preview status. This is an early checkpoint release for community evaluation. Benchmark numbers below may move before the final release.
Why this release? Most production LLM compression today picks one axis: either pure quantization (NVFP4, FP8, INT4) or pure sparsity (2:4), and rarely both at once on frontier-scale MoE models. This release goes one step further and demonstrates that sparsity + W4A4 quantization is viable on a trillion-parameter MoE, with end-to-end NVFP4×NVFP4 grouped GEMMs via FlashInfer's MoE FP4 kernels. We hope it motivates more open work (i.e., kernels, recipes, and tooling) on combined sparse–quantized compression.
- Base model: moonshotai/Kimi-K2.5 (MoE, 384 experts)
- Compression scheme: NVFP4 W4A4 (W = NVFP4, A = NVFP4 dynamic-local) + paired-4:8 sparsity on non-shared experts
- Effective precision: ~2 bits/weight on non-shared expert linears (paired-4:8 × NVFP4)
- Inference path: FlashInfer NVFP4×NVFP4 grouped GEMMs (Blackwell B200/B300, SM100/SM120)
- Checkpoint size: 595 GB (storage is still dense-NVFP4; sparse storage is future work — see Future Work)
Compression Details
| Field | Value |
|---|---|
| Weight dtype | NVFP4 (E2M1) |
| Weight group size | 16 |
| Weight scale dtype | FP8 E4M3, per-group |
| Weight global scale | FP32, per-tensor |
| Activation dtype | NVFP4 (E2M1), dynamic: "local" |
| Activation group size | 16 |
| Activation group scale dtype | FP8 E4M3, per-group (computed per batch at runtime) |
| Activation global scale | FP32, per-tensor (stored in checkpoint, per expert linear) |
| Sparsity | Paired 4:8 (NVIDIA Blackwell) |
| Quantized + sparsified layers | Non-shared MoE experts (gate_proj, up_proj, down_proj) |
| Uncompressed | lm_head, self_attn.*, shared_experts.*, router, embeddings |
| Format | compressed-tensors (NVFP4PackedCompressor) |
Paired-4:8 sparsity. Every 8 contiguous elements form 4 pairs of 2; exactly 2 of the 4 pairs are nonzero.
The zeroed positions are stored as FP4 zero codes inside weight_packed, so the sparsity structure is implicit — there is no separate bitmask tensor in the file.
Per-linear keys:
weight_packed— FP4 values, full K dimensionweight_scale— FP8 E4M3 per-16 group weight scalesweight_global_scale— FP32 per-tensor weight global scaleinput_global_scale— FP32 per-tensor activation global scale
How to Use
The weight format is a standard NVFP4 checkpoint — any inference stack with compressed-tensors NVFP4 support loads it directly. The vLLM flags below cover Kimi-K2.5-specific runtime needs (custom model code, chat-template parsers).
vLLM (with FlashInfer NVFP4 MoE kernels)
The recipe below follows the upstream vLLM guide for Kimi-K2.5: https://recipes.vllm.ai/moonshotai/Kimi-K2.5. Refer to that page for advanced options (long context, prefix caching, structured output) and version-specific notes. Tested on 4xB200.
uv pip install -U vllm --torch-backend=auto
VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2
Flag notes:
VLLM_USE_FLASHINFER_MOE_FP4=1— enables FlashInfer's NVFP4×NVFP4 grouped GEMM path for MoE experts.
Then query the OpenAI-compatible endpoint at http://localhost:8000/v1.
Hardware
- Blackwell (SM100 / SM120, e.g. B200): native NVFP4×NVFP4 compute support (e.g., FLASHINFER, CUTLASS)
- Tested on: 4× B200.
Evaluation — OpenLLM Leaderboard v1
All evaluations run with lm-evaluation-harness v0.4.11 against a vLLM 0.21.0 server on 4× B200 with VLLM_USE_FLASHINFER_MOE_FP4=1.
| Benchmark | Setup | Base (BF16) | SparseGPT + GPTQ one-shot | Ours | Δ vs base |
|---|---|---|---|---|---|
| ARC-Challenge | acc_norm, 25-shot | 74.23 | 62.54 | 68.43 | −5.80 |
| HellaSwag | acc_norm, 10-shot | 91.86 | 84.90 | 88.70 | −3.16 |
| MMLU | acc, 5-shot | 89.57 | 81.83 | 85.45 | −4.12 |
| TruthfulQA | mc2, 0-shot | 62.54 | 55.83 | 60.33 | −2.21 |
| Winogrande | acc, 5-shot | 82.48 | 79.95 | 83.35 | +0.87 |
| GSM8K | exact_match, 5-shot | 94.39 | 79.98 | 87.79 | −6.60 |
| Average | 82.51 | 74.17 | 79.01 | −3.50 |
Recovery: 79.01 / 82.51 = 95.76% of base-model average accuracy.
SparseGPT + GPTQ one-shot baseline. Reference point at the same compression target: SparseGPT picks the paired-4:8 mask, GPTQ quantizes the masked weights to NVFP4 (89.89% recovery, no activation quant).
Future Work
This preview ships the dense NVFP4 storage format with paired-4:8 zeros embedded as FP4 zero codes. That keeps the checkpoint compatible with current compressed-tensors and vLLM loaders out of the box, but leaves two opportunities on the table:
- Sparse NVFP4 storage — emit only the 4 nonzero pairs per 8-element block plus the ordered-metadata tensor (
ElementE) that CUTLASS / cuSPARSELt sparse-NVFP4 kernels expect. This cuts the on-disk and HBM footprint of the expert weights roughly in half. The paired-4:8 mask is structurally preserved in the current dense FP4 codes, so the conversion can run as an offline post-processing step on top of the released checkpoint. - CUTLASS sparse NVFP4 kernels — wire up sparse GEMM kernels (SM100/SM120) for sparse tensor-core throughput at inference. FlashInfer's MoE FP4 path is the current default dense kernel, and we expect further throughput by utilizing sparse GEMM.
Both are tracked for the next release, not this preview.
Contact
For questions or open a discussion on this preview, please fill free to reach out to kwanhee.lee@postech.ac.kr.
- Downloads last month
- -
Model tree for ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview
Base model
moonshotai/Kimi-K2.5Papers for ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Evaluation results
- normalized accuracy on AI2 Reasoning Challenge (25-Shot)test set Open LLM Leaderboard v168.430
- normalized accuracy on HellaSwag (10-Shot)validation set Open LLM Leaderboard v188.700
- accuracy on MMLU (5-Shot)test set Open LLM Leaderboard v185.450
- mc2 on TruthfulQA (0-Shot, MC2)validation set Open LLM Leaderboard v160.330
- accuracy on Winogrande (5-Shot)validation set Open LLM Leaderboard v183.350
- exact match (strict) on GSM8k (5-shot)test set Open LLM Leaderboard v187.790