Mistral-Small-3.2-24B-Instruct-2506 β€” NVFP4A16 Quantization

This is a NVFP4A16 (NVIDIA FP4 weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using LLM Compressor v0.10.0.2 with NVFP4A16 scheme via post-training quantization (PTQ) oneshot.

NVFP4A16 stores weights in NVIDIA's native FP4 format (4-bit floating-point) with BF16 activations. On Blackwell GPUs (RTX 5090, RTX PRO 6000, B200), this format leverages dedicated FP4 tensor cores that should deliver ~1.3Γ— faster inference compared to INT4 weight-only formats (GPTQ, AWQ, AutoRound) that must dequantize via Marlin/CUTLASS before matrix multiply.

Model Details

Property Value
Base Model mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method LLM Compressor oneshot (NVFP4A16)
Weight Precision FP4 (group_size=16, symmetric, tensor_group strategy)
Activation Precision BF16 (weight-only quantization)
Quantization Library LLM Compressor 0.10.0.2
Packing Format nvfp4-pack-quantized (compressed-tensors)
Architecture Mistral3ForConditionalGeneration
LM Layers 40 MistralDecoder layers
Hidden Size 5120
Intermediate Size 32768
Attention Heads 32 (query), 8 (key/value, GQA)
Head Dimension 128
Vocabulary Size 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window 131,072 tokens
Vision Encoder Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector patch_merge (spatial_merge_size=2)
Quantized Components Text decoder Linear layers only
Preserved in BF16 Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "config_groups": {
    "group_0": {
      "format": "nvfp4-pack-quantized",
      "input_activations": null,
      "output_activations": null,
      "targets": ["Linear"],
      "weights": {
        "num_bits": 4,
        "type": "float",
        "group_size": 16,
        "strategy": "tensor_group",
        "symmetric": true,
        "observer": "memoryless_minmax",
        "scale_dtype": "torch.float8_e4m3fn",
        "dynamic": false
      }
    }
  },
  "format": "nvfp4-pack-quantized",
  "quant_method": "compressed-tensors",
  "quantization_status": "compressed"
}

Key parameters:

  • scheme=NVFP4A16: Weight-only FP4 quantization β€” no activation quantization. Critical for VLMs: activation quantization destroys vision quality.
  • group_size=16: Per-16-element scaling groups (NVFP4 native granularity)
  • strategy=tensor_group: One scale per tensor group (NVIDIA's recommended strategy for FP4)
  • scale_dtype=float8_e4m3fn: Scales stored in FP8 E4M3 for hardware efficiency
  • symmetric=true: Symmetric quantization (no zero-point)
  • observer=memoryless_minmax: Min/max calibration without temporal smoothing

Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head, 're:.*vision_tower.*', 're:.*multi_modal_projector.*']
      scheme: NVFP4A16
      bypass_divisibility_checks: false

Calibration Dataset

The model was calibrated using a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors). Short samples were concatenated into chunks of β‰₯2,048 Tekken tokens each:

Source Domain HF ID
Magicoder-Evol-Instruct Coding (instruction + response pairs) ise-uiuc/Magicoder-Evol-Instruct-110K
xLAM Function Calling Tool/function calling Salesforce/xlam-function-calling-60k
Hermes Function Calling v1 Tool calling (ShareGPT format) NousResearch/hermes-function-calling-v1
Pile-10k General reasoning and knowledge NeelNanda/pile-10k
Domain instructions Coding + tool calling (local file, 5Γ— duplicated for weight) Local: imatrix_mistral_domain_calib_5x.txt

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) β€” the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model PPL Ξ” vs Base
Base BF16 7.0332 β€”
NVFP4A16 (this model) 7.3302 +4.22%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256–511]), which ensures every scored token has at least 256 tokens of left context. 32 chunks, ~8,150 tokens scored. KLD direction: KL(P_base β€– P_quant) β€” "how well does the quantized model approximate the base?"

Metric Value
Mean KLD 0.0935
Median KLD 0.0368
99th %ile KLD 0.790
95th %ile KLD 0.364
90th %ile KLD β€”
Ξ”p RMS 2.87%
Same top-p 91.1%

Same top-p = 91.1% means both quantized and base models agree on the most likely token 91.1% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10–15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10–0.11. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound β€” small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.

Prompt Len Gen Len Step-0 KLD
128 128 0.140
512 128 0.088
1024 128 0.118

Step-0 KLD (0.088–0.140) is consistent with the static prefill KLD (0.0935), with variance driven by prompt length. Shorter prompts have less context β†’ more sensitive to quantization noise. At 512-token prompts, KLD drops to 0.088, closely matching the AutoRound W4A16's static KLD (0.0746).

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric Value
Aggregate Throughput 103.3 tok/s
Total tokens generated 12,851 (20 requests Γ— up to 1,024 tokens)
Average Latency 6.22 s
Min/Max Latency 0.39 s / 9.70 s
Per-request Throughput 7.7–105.6 tok/s
Success Rate 20/20 (100%)

32 Concurrent Requests

Metric Value
Aggregate Throughput 2,612.7 tok/s
Total tokens generated 355,785 (640 requests Γ— up to 1,024 tokens)
Average Latency 6.54 s (end-to-end per request, including queuing)
Min/Max Latency 0.31 s / 12.40 s
Per-request Throughput 4.8–92.8 tok/s
Success Rate 640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM Recommended Config
96 GB (RTX PRO 6000, H100, H200) gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional
32 GB (RTX 5090) gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192

Minimum: 1Γ— GPU with β‰₯24 GB VRAM (with fp8 KV cache and reduced context).

NVFP4A16 requires Blackwell or later (sm120+): RTX 5090, RTX PRO 6000, B200 (with driver 580+). On Ampere/Ada GPUs (sm80–sm89), vLLM will fall back to software dequantization β€” NVFP4A16's speed advantage disappears.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-nvfp4 \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  --ipc=host --shm-size=16g \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  /workspace/model \
  --host 0.0.0.0 --port 8000 \
  --tokenizer-mode mistral \
  --config-format hf \
  --load-format auto \
  --quantization compressed-tensors \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --disable-custom-all-reduce \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image":4}' \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

Example vLLM Configuration (YAML)

model: /workspace/model
quantization: compressed-tensors
tokenizer_mode: mistral
config_format: hf
load_format: auto
dtype: bfloat16
gpu_memory_utilization: 0.85
max_model_len: 32768
max_num_batched_tokens: 8192
kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
enable_chunked_prefill: true
limit_mm_per_prompt:
  image: 4
enable_auto_tool_choice: true
tool_call_parser: mistral
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/workspace/model","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Single-Shard Structure

This model is stored as a single model.safetensors file (~15 GB), containing all 40 quantized LM decoder layers + vision tower (BF16) + projector (BF16) + lm_head (BF16) + embed_tokens (BF16). No sharding.

Files in This Repository

File Size Description
model.safetensors ~15.0 GB All model weights (quantized LM + BF16 vision/projector/lm_head)
model.safetensors.index.json β€” Weight map (single-file index)
config.json β€” Model configuration with quantization_config
params.json β€” Mistral-native parameter specification
quantization_config.json β€” Compressed-tensors NVFP4A16 parameters
recipe.yaml β€” LLM Compressor quantization recipe
tekken.json ~15 MB Tekken tokenizer (Mistral-specific)
generation_config.json β€” Generation parameters
preprocessor_config.json β€” Image preprocessor configuration
processor_config.json β€” Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-nvfp4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 NVFP4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-NVFP4A16}},
  note = {Quantized with LLM Compressor 0.14.1}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.


Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: LLM Compressor Deployment Engine: vLLM

Downloads last month
190
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gratex/mistral-small-3.2-24b-Instruct-2506-NVFP4A16