Instructions to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview

SGLang

How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview with Docker Model Runner:
```
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview
```

Kimi-K2.5-P48-NVFP4-W4A4-Preview

A preview W4A4 compressed checkpoint of moonshotai/Kimi-K2.5, combining paired-4:8 structured sparsity with NVFP4 weight + activation quantization on MoE expert weights — targeting NVIDIA Blackwell sparse tensor cores.

Preview status. This is an early checkpoint release for community evaluation. Benchmark numbers below may move before the final release.

Why this release? Most production LLM compression today picks one axis: either pure quantization (NVFP4, FP8, INT4) or pure sparsity (2:4), and rarely both at once on frontier-scale MoE models. This release goes one step further and demonstrates that sparsity + W4A4 quantization is viable on a trillion-parameter MoE, with end-to-end NVFP4×NVFP4 grouped GEMMs via FlashInfer's MoE FP4 kernels. We hope it motivates more open work (i.e., kernels, recipes, and tooling) on combined sparse–quantized compression.

Base model: moonshotai/Kimi-K2.5 (MoE, 384 experts)
Compression scheme: NVFP4 W4A4 (W = NVFP4, A = NVFP4 dynamic-local) + paired-4:8 sparsity on non-shared experts
Effective precision: ~2 bits/weight on non-shared expert linears (paired-4:8 × NVFP4)
Inference path: FlashInfer NVFP4×NVFP4 grouped GEMMs (Blackwell B200/B300, SM100/SM120)
Checkpoint size: 595 GB (storage is still dense-NVFP4; sparse storage is future work — see Future Work)

Compression Details

Field	Value
Weight dtype	NVFP4 (E2M1)
Weight group size	16
Weight scale dtype	FP8 E4M3, per-group
Weight global scale	FP32, per-tensor
Activation dtype	NVFP4 (E2M1), `dynamic: "local"`
Activation group size	16
Activation group scale dtype	FP8 E4M3, per-group (computed per batch at runtime)
Activation global scale	FP32, per-tensor (stored in checkpoint, per expert linear)
Sparsity	Paired 4:8 (NVIDIA Blackwell)
Quantized + sparsified layers	Non-shared MoE experts (`gate_proj`, `up_proj`, `down_proj`)
Uncompressed	`lm_head`, `self_attn.`, `shared_experts.`, router, embeddings
Format	`compressed-tensors` (`NVFP4PackedCompressor`)

Paired-4:8 sparsity. Every 8 contiguous elements form 4 pairs of 2; exactly 2 of the 4 pairs are nonzero. The zeroed positions are stored as FP4 zero codes inside weight_packed, so the sparsity structure is implicit — there is no separate bitmask tensor in the file.

Per-linear keys:

weight_packed — FP4 values, full K dimension
weight_scale — FP8 E4M3 per-16 group weight scales
weight_global_scale — FP32 per-tensor weight global scale
input_global_scale — FP32 per-tensor activation global scale

How to Use

The weight format is a standard NVFP4 checkpoint — any inference stack with compressed-tensors NVFP4 support loads it directly. The vLLM flags below cover Kimi-K2.5-specific runtime needs (custom model code, chat-template parsers).

vLLM (with FlashInfer NVFP4 MoE kernels)

The recipe below follows the upstream vLLM guide for Kimi-K2.5: https://recipes.vllm.ai/moonshotai/Kimi-K2.5. Refer to that page for advanced options (long context, prefix caching, structured output) and version-specific notes. Tested on 4xB200.

uv pip install -U vllm --torch-backend=auto

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve ISTA-DASLab/Kimi-K2.5-P48-NVFP4-W4A4-Preview \
    --tensor-parallel-size 4 \
    --mm-encoder-tp-mode data \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2

Flag notes:

VLLM_USE_FLASHINFER_MOE_FP4=1 — enables FlashInfer's NVFP4×NVFP4 grouped GEMM path for MoE experts.

Then query the OpenAI-compatible endpoint at http://localhost:8000/v1.

Hardware

Blackwell (SM100 / SM120, e.g. B200): native NVFP4×NVFP4 compute support (e.g., FLASHINFER, CUTLASS)
Tested on: 4× B200.

Evaluation — OpenLLM Leaderboard v1

All evaluations run with lm-evaluation-harness v0.4.11 against a vLLM 0.21.0 server on 4× B200 with VLLM_USE_FLASHINFER_MOE_FP4=1.

Benchmark	Setup	Base (BF16)	SparseGPT + GPTQ one-shot	Ours	Δ vs base
ARC-Challenge	acc_norm, 25-shot	74.23	62.54	68.43	−5.80
HellaSwag	acc_norm, 10-shot	91.86	84.90	88.70	−3.16
MMLU	acc, 5-shot	89.57	81.83	85.45	−4.12
TruthfulQA	mc2, 0-shot	62.54	55.83	60.33	−2.21
Winogrande	acc, 5-shot	82.48	79.95	83.35	+0.87
GSM8K	exact_match, 5-shot	94.39	79.98	87.79	−6.60
Average		82.51	74.17	79.01	−3.50

Recovery: 79.01 / 82.51 = 95.76% of base-model average accuracy.

SparseGPT + GPTQ one-shot baseline. Reference point at the same compression target: SparseGPT picks the paired-4:8 mask, GPTQ quantizes the masked weights to NVFP4 (89.89% recovery, no activation quant).

Future Work

This preview ships the dense NVFP4 storage format with paired-4:8 zeros embedded as FP4 zero codes. That keeps the checkpoint compatible with current compressed-tensors and vLLM loaders out of the box, but leaves two opportunities on the table:

Sparse NVFP4 storage — emit only the 4 nonzero pairs per 8-element block plus the ordered-metadata tensor (ElementE) that CUTLASS / cuSPARSELt sparse-NVFP4 kernels expect. This cuts the on-disk and HBM footprint of the expert weights roughly in half. The paired-4:8 mask is structurally preserved in the current dense FP4 codes, so the conversion can run as an offline post-processing step on top of the released checkpoint.
CUTLASS sparse NVFP4 kernels — wire up sparse GEMM kernels (SM100/SM120) for sparse tensor-core throughput at inference. FlashInfer's MoE FP4 path is the current default dense kernel, and we expect further throughput by utilizing sparse GEMM.

Both are tracked for the next release, not this preview.