Mistral-Small-3.2-24B-Instruct-2506 — NVFP4A16 Quantization

This is a NVFP4A16 (NVIDIA FP4 weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using LLM Compressor v0.10.0.2 with NVFP4A16 scheme via post-training quantization (PTQ) oneshot.

NVFP4A16 stores weights in NVIDIA's native FP4 format (4-bit floating-point) with BF16 activations. On Blackwell GPUs (RTX 5090, RTX PRO 6000, B200), this format leverages dedicated FP4 tensor cores that should deliver ~1.3× faster inference compared to INT4 weight-only formats (GPTQ, AWQ, AutoRound) that must dequantize via Marlin/CUTLASS before matrix multiply.

Model Details

Property	Value
Base Model	mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method	LLM Compressor oneshot (NVFP4A16)
Weight Precision	FP4 (group_size=16, symmetric, tensor_group strategy)
Activation Precision	BF16 (weight-only quantization)
Quantization Library	LLM Compressor 0.10.0.2
Packing Format	nvfp4-pack-quantized (compressed-tensors)
Architecture	Mistral3ForConditionalGeneration
LM Layers	40 MistralDecoder layers
Hidden Size	5120
Intermediate Size	32768
Attention Heads	32 (query), 8 (key/value, GQA)
Head Dimension	128
Vocabulary Size	131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window	131,072 tokens
Vision Encoder	Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector	patch_merge (spatial_merge_size=2)
Quantized Components	Text decoder Linear layers only
Preserved in BF16	Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "config_groups": {
    "group_0": {
      "format": "nvfp4-pack-quantized",
      "input_activations": null,
      "output_activations": null,
      "targets": ["Linear"],
      "weights": {
        "num_bits": 4,
        "type": "float",
        "group_size": 16,
        "strategy": "tensor_group",
        "symmetric": true,
        "observer": "memoryless_minmax",
        "scale_dtype": "torch.float8_e4m3fn",
        "dynamic": false
      }
    }
  },
  "format": "nvfp4-pack-quantized",
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed"
}

Key parameters:

scheme=NVFP4A16: Weight-only FP4 quantization — no activation quantization. Critical for VLMs: activation quantization destroys vision quality.
group_size=16: Per-16-element scaling groups (NVFP4 native granularity)
strategy=tensor_group: One scale per tensor group (NVIDIA's recommended strategy for FP4)
scale_dtype=float8_e4m3fn: Scales stored in FP8 E4M3 for hardware efficiency
symmetric=true: Symmetric quantization (no zero-point)
observer=memoryless_minmax: Min/max calibration without temporal smoothing

Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*vision_tower.*', 're:.*multi_modal_projector.*']
      scheme: NVFP4A16
      bypass_divisibility_checks: false

Calibration Dataset

The model was calibrated using a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors). Short samples were concatenated into chunks of ≥2,048 Tekken tokens each:

Source	Domain	HF ID
Magicoder-Evol-Instruct	Coding (instruction + response pairs)	`ise-uiuc/Magicoder-Evol-Instruct-110K`
xLAM Function Calling	Tool/function calling	`Salesforce/xlam-function-calling-60k`
Hermes Function Calling v1	Tool calling (ShareGPT format)	`NousResearch/hermes-function-calling-v1`
Pile-10k	General reasoning and knowledge	`NeelNanda/pile-10k`
Domain instructions	Coding + tool calling (local file, 5× duplicated for weight)	Local: `imatrix_mistral_domain_calib_5x.txt`

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model	PPL	Δ vs Base
Base BF16	7.0332	—
NVFP4A16 (this model)	7.3302	+4.22%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256–511]), which ensures every scored token has at least 256 tokens of left context. 32 chunks, ~8,150 tokens scored. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"

Metric	Value
Mean KLD	0.0935
Median KLD	0.0368
99th %ile KLD	0.790
95th %ile KLD	0.364
90th %ile KLD	—
Δp RMS	2.87%
Same top-p	91.1%

Same top-p = 91.1% means both quantized and base models agree on the most likely token 91.1% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10–15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10–0.11. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound — small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.

Prompt Len	Gen Len	Step-0 KLD
128	128	0.140
512	128	0.088
1024	128	0.118

Step-0 KLD (0.088–0.140) is consistent with the static prefill KLD (0.0935), with variance driven by prompt length. Shorter prompts have less context → more sensitive to quantization noise. At 512-token prompts, KLD drops to 0.088, closely matching the AutoRound W4A16's static KLD (0.0746).

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric	Value
Aggregate Throughput	103.3 tok/s
Total tokens generated	12,851 (20 requests × up to 1,024 tokens)
Average Latency	6.22 s
Min/Max Latency	0.39 s / 9.70 s
Per-request Throughput	7.7–105.6 tok/s
Success Rate	20/20 (100%)

32 Concurrent Requests

Metric	Value
Aggregate Throughput	2,612.7 tok/s
Total tokens generated	355,785 (640 requests × up to 1,024 tokens)
Average Latency	6.54 s (end-to-end per request, including queuing)
Min/Max Latency	0.31 s / 12.40 s
Per-request Throughput	4.8–92.8 tok/s
Success Rate	640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM	Recommended Config
96 GB (RTX PRO 6000, H100, H200)	`gpu_memory_utilization: 0.95`, `max_model_len: 131072`, KV: fp8_e4m3 optional
32 GB (RTX 5090)	`gpu_memory_utilization: 0.96`, `max_model_len: 131072`, `kv_cache_dtype: fp8_e4m3`, `max_num_batched_tokens: 8192`

Minimum: 1× GPU with ≥24 GB VRAM (with fp8 KV cache and reduced context).

NVFP4A16 requires Blackwell or later (sm120+): RTX 5090, RTX PRO 6000, B200 (with driver 580+). On Ampere/Ada GPUs (sm80–sm89), vLLM will fall back to software dequantization — NVFP4A16's speed advantage disappears.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-nvfp4 \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  --ipc=host --shm-size=16g \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  /workspace/model \
  --host 0.0.0.0 --port 8000 \
  --tokenizer-mode mistral \
  --config-format hf \
  --load-format auto \
  --quantization compressed-tensors \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --disable-custom-all-reduce \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4}' \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

Example vLLM Configuration (YAML)

model: /workspace/model
quantization: compressed-tensors
tokenizer_mode: mistral
config_format: hf
load_format: auto
dtype: bfloat16
gpu_memory_utilization: 0.85
max_model_len: 32768
max_num_batched_tokens: 8192
kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
enable_chunked_prefill: true
limit_mm_per_prompt:
  image: 4
enable_auto_tool_choice: true
tool_call_parser: mistral
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/workspace/model","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Single-Shard Structure

This model is stored as a single model.safetensors file (~15 GB), containing all 40 quantized LM decoder layers + vision tower (BF16) + projector (BF16) + lm_head (BF16) + embed_tokens (BF16). No sharding.

Files in This Repository

File	Size	Description
`model.safetensors`	~15.0 GB	All model weights (quantized LM + BF16 vision/projector/lm_head)
`model.safetensors.index.json`	—	Weight map (single-file index)
`config.json`	—	Model configuration with quantization_config
`params.json`	—	Mistral-native parameter specification
`quantization_config.json`	—	Compressed-tensors NVFP4A16 parameters
`recipe.yaml`	—	LLM Compressor quantization recipe
`tekken.json`	~15 MB	Tekken tokenizer (Mistral-specific)
`generation_config.json`	—	Generation parameters
`preprocessor_config.json`	—	Image preprocessor configuration
`processor_config.json`	—	Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-nvfp4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 NVFP4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-NVFP4A16}},
  note = {Quantized with LLM Compressor 0.14.1}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.

Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: LLM Compressor Deployment Engine: vLLM

Downloads last month: 190

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gratex/mistral-small-3.2-24b-Instruct-2506-NVFP4A16

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Quantized

(59)

this model