Mistral-Small-3.2-24B-Instruct-2506 β NVFP4A16 Quantization
This is a NVFP4A16 (NVIDIA FP4 weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using LLM Compressor v0.10.0.2 with NVFP4A16 scheme via post-training quantization (PTQ) oneshot.
NVFP4A16 stores weights in NVIDIA's native FP4 format (4-bit floating-point) with BF16 activations. On Blackwell GPUs (RTX 5090, RTX PRO 6000, B200), this format leverages dedicated FP4 tensor cores that should deliver ~1.3Γ faster inference compared to INT4 weight-only formats (GPTQ, AWQ, AutoRound) that must dequantize via Marlin/CUTLASS before matrix multiply.
Model Details
| Property | Value |
|---|---|
| Base Model | mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
| Quantization Method | LLM Compressor oneshot (NVFP4A16) |
| Weight Precision | FP4 (group_size=16, symmetric, tensor_group strategy) |
| Activation Precision | BF16 (weight-only quantization) |
| Quantization Library | LLM Compressor 0.10.0.2 |
| Packing Format | nvfp4-pack-quantized (compressed-tensors) |
| Architecture | Mistral3ForConditionalGeneration |
| LM Layers | 40 MistralDecoder layers |
| Hidden Size | 5120 |
| Intermediate Size | 32768 |
| Attention Heads | 32 (query), 8 (key/value, GQA) |
| Head Dimension | 128 |
| Vocabulary Size | 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model) |
| Context Window | 131,072 tokens |
| Vision Encoder | Pixtral (24 layers, hidden_size=1024, patch_size=14) |
| Vision Projector | patch_merge (spatial_merge_size=2) |
| Quantized Components | Text decoder Linear layers only |
| Preserved in BF16 | Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms |
Quantization Configuration
{
"config_groups": {
"group_0": {
"format": "nvfp4-pack-quantized",
"input_activations": null,
"output_activations": null,
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "float",
"group_size": 16,
"strategy": "tensor_group",
"symmetric": true,
"observer": "memoryless_minmax",
"scale_dtype": "torch.float8_e4m3fn",
"dynamic": false
}
}
},
"format": "nvfp4-pack-quantized",
"quant_method": "compressed-tensors",
"quantization_status": "compressed"
}
Key parameters:
- scheme=NVFP4A16: Weight-only FP4 quantization β no activation quantization. Critical for VLMs: activation quantization destroys vision quality.
- group_size=16: Per-16-element scaling groups (NVFP4 native granularity)
- strategy=tensor_group: One scale per tensor group (NVIDIA's recommended strategy for FP4)
- scale_dtype=float8_e4m3fn: Scales stored in FP8 E4M3 for hardware efficiency
- symmetric=true: Symmetric quantization (no zero-point)
- observer=memoryless_minmax: Min/max calibration without temporal smoothing
Recipe
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*vision_tower.*', 're:.*multi_modal_projector.*']
scheme: NVFP4A16
bypass_divisibility_checks: false
Calibration Dataset
The model was calibrated using a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors). Short samples were concatenated into chunks of β₯2,048 Tekken tokens each:
| Source | Domain | HF ID |
|---|---|---|
| Magicoder-Evol-Instruct | Coding (instruction + response pairs) | ise-uiuc/Magicoder-Evol-Instruct-110K |
| xLAM Function Calling | Tool/function calling | Salesforce/xlam-function-calling-60k |
| Hermes Function Calling v1 | Tool calling (ShareGPT format) | NousResearch/hermes-function-calling-v1 |
| Pile-10k | General reasoning and knowledge | NeelNanda/pile-10k |
| Domain instructions | Coding + tool calling (local file, 5Γ duplicated for weight) | Local: imatrix_mistral_domain_calib_5x.txt |
Quality Benchmarks
All benchmarks use wikitext-2-raw-v1 (test split) β the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.
WikiText-2 Perplexity (ctx=512)
Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.
| Model | PPL | Ξ vs Base |
|---|---|---|
| Base BF16 | 7.0332 | β |
| NVFP4A16 (this model) | 7.3302 | +4.22% |
KL Divergence vs BF16 (Static / Prefill)
KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.
Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256β511]), which ensures every scored token has at least 256 tokens of left context. 32 chunks, ~8,150 tokens scored. KLD direction: KL(P_base β P_quant) β "how well does the quantized model approximate the base?"
| Metric | Value |
|---|---|
| Mean KLD | 0.0935 |
| Median KLD | 0.0368 |
| 99th %ile KLD | 0.790 |
| 95th %ile KLD | 0.364 |
| 90th %ile KLD | β |
| Ξp RMS | 2.87% |
| Same top-p | 91.1% |
Same top-p = 91.1% means both quantized and base models agree on the most likely token 91.1% of the time.
Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10β15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10β0.11. Relative comparisons between quantization methods remain valid regardless.
KL Divergence vs BF16 (Generation / Autoregressive)
KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound β small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.
| Prompt Len | Gen Len | Step-0 KLD |
|---|---|---|
| 128 | 128 | 0.140 |
| 512 | 128 | 0.088 |
| 1024 | 128 | 0.118 |
Step-0 KLD (0.088β0.140) is consistent with the static prefill KLD (0.0935), with variance driven by prompt length. Shorter prompts have less context β more sensitive to quantization noise. At 512-token prompts, KLD drops to 0.088, closely matching the AutoRound W4A16's static KLD (0.0746).
vLLM Throughput (RTX 5090, 32 GB)
Single Request
| Metric | Value |
|---|---|
| Aggregate Throughput | 103.3 tok/s |
| Total tokens generated | 12,851 (20 requests Γ up to 1,024 tokens) |
| Average Latency | 6.22 s |
| Min/Max Latency | 0.39 s / 9.70 s |
| Per-request Throughput | 7.7β105.6 tok/s |
| Success Rate | 20/20 (100%) |
32 Concurrent Requests
| Metric | Value |
|---|---|
| Aggregate Throughput | 2,612.7 tok/s |
| Total tokens generated | 355,785 (640 requests Γ up to 1,024 tokens) |
| Average Latency | 6.54 s (end-to-end per request, including queuing) |
| Min/Max Latency | 0.31 s / 12.40 s |
| Per-request Throughput | 4.8β92.8 tok/s |
| Success Rate | 640/640 (100%) |
Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.
Hardware Requirements
| GPU VRAM | Recommended Config |
|---|---|
| 96 GB (RTX PRO 6000, H100, H200) | gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional |
| 32 GB (RTX 5090) | gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192 |
Minimum: 1Γ GPU with β₯24 GB VRAM (with fp8 KV cache and reduced context).
NVFP4A16 requires Blackwell or later (sm120+): RTX 5090, RTX PRO 6000, B200 (with driver 580+). On Ampere/Ada GPUs (sm80βsm89), vLLM will fall back to software dequantization β NVFP4A16's speed advantage disappears.
Usage with vLLM
Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)
Docker Deployment
docker run -d --name vllm-mistral-nvfp4 \
--runtime=nvidia --gpus '"device=0"' \
-p 8000:8000 \
-v /path/to/model:/workspace/model \
--ipc=host --shm-size=16g \
vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
/workspace/model \
--host 0.0.0.0 --port 8000 \
--tokenizer-mode mistral \
--config-format hf \
--load-format auto \
--quantization compressed-tensors \
--dtype bfloat16 \
--gpu-memory-utilization 0.96 \
--max-model-len 131072 \
--disable-custom-all-reduce \
--trust-remote-code \
--limit-mm-per-prompt '{"image":4}' \
--tool-call-parser mistral \
--enable-auto-tool-choice
Example vLLM Configuration (YAML)
model: /workspace/model
quantization: compressed-tensors
tokenizer_mode: mistral
config_format: hf
load_format: auto
dtype: bfloat16
gpu_memory_utilization: 0.85
max_model_len: 32768
max_num_batched_tokens: 8192
kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
enable_chunked_prefill: true
limit_mm_per_prompt:
image: 4
enable_auto_tool_choice: true
tool_call_parser: mistral
disable_custom_all_reduce: true
Inference Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"/workspace/model","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'
Notes
Tokenizer: Tekken (Mistral-specific)
This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.
Vision: Image Size Limit
The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.
Single-Shard Structure
This model is stored as a single model.safetensors file (~15 GB), containing all 40 quantized LM decoder layers + vision tower (BF16) + projector (BF16) + lm_head (BF16) + embed_tokens (BF16). No sharding.
Files in This Repository
| File | Size | Description |
|---|---|---|
model.safetensors |
~15.0 GB | All model weights (quantized LM + BF16 vision/projector/lm_head) |
model.safetensors.index.json |
β | Weight map (single-file index) |
config.json |
β | Model configuration with quantization_config |
params.json |
β | Mistral-native parameter specification |
quantization_config.json |
β | Compressed-tensors NVFP4A16 parameters |
recipe.yaml |
β | LLM Compressor quantization recipe |
tekken.json |
~15 MB | Tekken tokenizer (Mistral-specific) |
generation_config.json |
β | Generation parameters |
preprocessor_config.json |
β | Image preprocessor configuration |
processor_config.json |
β | Processor configuration |
License
This quantization is released under the Apache 2.0 License, following the base model's license.
The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.
Citation
If you use this model in your research, please cite:
@misc{mistral-small-3.2-24b-nvfp4a16,
title = {Mistral-Small-3.2-24B-Instruct-2506 NVFP4A16 Quantization},
author = {Gratex International},
year = {2026},
howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-NVFP4A16}},
note = {Quantized with LLM Compressor 0.14.1}
}
Acknowledgments
This quantization was produced using hardware provided by Gratex International, a.s.
Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: LLM Compressor Deployment Engine: vLLM
- Downloads last month
- 190
Model tree for gratex/mistral-small-3.2-24b-Instruct-2506-NVFP4A16
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503