Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Mistral-Small-3.2-24B-Instruct-2506 โ AutoRound W4A16 Quantization
This is a W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.10.2 with SignRound optimization (1000 iterations).
AutoRound W4A16 stores weights as INT4 with BF16 activations. Weights are dequantized via GPTQ-style group quantization (group_size=128) before matrix multiply. This format is widely supported across GPU architectures (Ampere, Ada, Blackwell) and inference engines (vLLM, SGLang, TensorRT-LLM).
Model Details
| Property | Value |
|---|---|
| Base Model | mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
| Quantization Method | AutoRound (SignRound, W4A16) |
| Weight Precision | INT4 (group_size=128, symmetric) |
| Activation Precision | BF16 (weight-only quantization) |
| Quantization Library | AutoRound 0.10.2 |
| Packing Format | auto_round:auto_gptq |
| Architecture | Mistral3ForConditionalGeneration |
| LM Layers | 40 MistralDecoder layers |
| Hidden Size | 5120 |
| Intermediate Size | 32768 |
| Attention Heads | 32 (query), 8 (key/value, GQA) |
| Head Dimension | 128 |
| Vocabulary Size | 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model) |
| Context Window | 131,072 tokens |
| Vision Encoder | Pixtral (24 layers, hidden_size=1024, patch_size=14) |
| Vision Projector | patch_merge (spatial_merge_size=2) |
| Quantized Components | Text decoder Linear layers only |
| Preserved in BF16 | Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms |
Quantization Configuration
{
"bits": 4,
"data_type": "int",
"group_size": 128,
"sym": true,
"batch_size": 4,
"iters": 1000,
"low_gpu_mem_usage": true,
"nsamples": 512,
"block_name_to_quantize": "model.language_model.layers",
"quant_method": "auto-round",
"packing_format": "auto_round:auto_gptq"
}
Key parameters:
- iters=1000: Maximum SignRound optimization steps per block (~5ร slower than default 200, best accuracy)
- nsamples=512: 512 calibration samples (4ร default of 128)
- sym=true: Symmetric quantization (no zero-point)
- group_size=128: Per-128-element scaling groups
Calibration Dataset
The 512 calibration samples were built from a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors in the llm-compressor environment). Short samples were concatenated into chunks of โฅ2,048 Tekken tokens each:
| Source | Domain | HF ID |
|---|---|---|
| Magicoder-Evol-Instruct | Coding (instruction + response pairs) | ise-uiuc/Magicoder-Evol-Instruct-110K |
| xLAM Function Calling | Tool/function calling (query + tools + answers) | Salesforce/xlam-function-calling-60k |
| Hermes Function Calling v1 | Tool calling (ShareGPT format conversations) | NousResearch/hermes-function-calling-v1 |
| Pile-10k | General reasoning and knowledge | NeelNanda/pile-10k |
| Domain instructions | Coding + tool calling (local file, 5ร duplicated for weight) | Local: imatrix_mistral_domain_calib_5x.txt |
Quality Benchmarks
All benchmarks use wikitext-2-raw-v1 (test split) โ the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.
WikiText-2 Perplexity (ctx=512)
Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.
| Model | PPL | ฮ vs Base |
|---|---|---|
| Base BF16 | 7.0332 | โ |
| AutoRound W4A16 (this model) | 7.2478 | +3.05% |
KL Divergence vs BF16 (Static / Prefill)
KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.
Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256โ511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base โ P_quant) โ "how well does the quantized model approximate the base?"
| Metric | Value |
|---|---|
| Mean KLD | 0.0746 |
| Median KLD | 0.0279 |
| 99th %ile KLD | 0.592 |
| 95th %ile KLD | 0.286 |
| ฮp RMS | 2.12% |
| Same top-p | 92.6% |
Same top-p = 92.6% means both quantized and base models agree on the most likely token 92.6% of the time.
Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10โ15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.09. Relative comparisons between quantization methods remain valid regardless.
KL Divergence vs BF16 (Generation / Autoregressive)
KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound โ small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.
| Prompt Len | Gen Len | Step-0 KLD |
|---|---|---|
| 128 | 128 | 0.072 |
| 512 | 128 | 0.080 |
| 1024 | 128 | 0.084 |
Step-0 KLD (0.072โ0.084) is consistent with the static prefill KLD (0.0746), with variance driven by prompt length. Shorter prompts have less context โ more sensitive to quantization noise.
vLLM Throughput (RTX 5090, 32 GB)
Single Request
| Metric | Value |
|---|---|
| Aggregate Throughput | 107.6 tok/s |
| Total tokens generated | 11,251 (20 requests ร up to 1,024 tokens) |
| Average Latency | 5.23 s |
| Min/Max Latency | 0.29 s / 9.35 s |
| Per-request Throughput | 9.2โ110.8 tok/s |
| Success Rate | 20/20 (100%) |
32 Concurrent Requests
| Metric | Value |
|---|---|
| Aggregate Throughput | 2,604.0 tok/s |
| Total tokens generated | 354,430 (640 requests ร up to 1,024 tokens) |
| Average Latency | 6.51 s (end-to-end per request, including queuing) |
| Min/Max Latency | 0.23 s / 13.32 s |
| Per-request Throughput | 2.3โ99.6 tok/s |
| Success Rate | 640/640 (100%) |
Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.
Hardware Requirements
| GPU VRAM | Recommended Config |
|---|---|
| 96 GB (RTX PRO 6000, H100, H200) | gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional |
| 32 GB (RTX 5090) | gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192 |
Minimum: 1ร GPU with โฅ24 GB VRAM (with fp8 KV cache and reduced context).
AutoRound W4A16 works on all GPU architectures (sm75+): Ampere (A100, RTX 3090), Ada (RTX 4090), Blackwell (RTX 5090, RTX PRO 6000). No architecture-specific tensor cores required โ dequantization is handled by Marlin/CUTLASS kernels.
Usage with vLLM
Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)
Docker Deployment
docker run -d --name vllm-mistral-autoround \
--runtime=nvidia --gpus '"device=0"' \
-p 8000:8000 \
-v /path/to/model:/workspace/model \
-v /path/to/vllm_config.yaml:/vllm_config.yaml \
--ipc=host --shm-size=16g \
--restart unless-stopped \
vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
--config /vllm_config.yaml
Example vLLM Configuration (YAML)
This configuration is deployed and verified on an RTX 5090 (32 GB):
# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-Instruct-AWQ-W4A16"
trust_remote_code: true
tensor_parallel_size: 1
# -- Quantization ------------------------------------------------------------
quantization: auto_round
# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral
# -- Data Type ---------------------------------------------------------------
dtype: bfloat16
# -- Load Format -------------------------------------------------------------
load_format: auto
# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32
# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3
# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
image: 4
# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral
# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
temperature: 0.15
# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true
Inference Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistral-small-3.2-24b-Instruct-AWQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'
Notes
Tokenizer: Tekken (Mistral-specific)
This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.
Vision: Image Size Limit
The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.
Files in This Repository
| File | Size | Description |
|---|---|---|
model-00001-of-00005.safetensors |
~2.9 GB | Quantized LM layers (shard 1) |
model-00002-of-00005.safetensors |
~2.9 GB | Quantized LM layers (shard 2) |
model-00003-of-00005.safetensors |
~2.9 GB | Quantized LM layers (shard 3) |
model-00004-of-00005.safetensors |
~2.9 GB | Quantized LM layers (shard 4) |
model-00005-of-00005.safetensors |
~2.7 GB | LM layers + vision tower + projector + lm_head (BF16) |
model.safetensors.index.json |
โ | Shard index with weight map |
config.json |
โ | Model configuration with quantization_config |
params.json |
โ | Mistral-native parameter specification |
quantization_config.json |
โ | AutoRound quantization parameters |
tekken.json |
~15 MB | Tekken tokenizer (Mistral-specific) |
tokenizer.json |
~20 MB | HF-compatible tokenizer fallback |
tokenizer_config.json |
~22 MB | Tokenizer configuration |
generation_config.json |
โ | Generation parameters |
preprocessor_config.json |
โ | Image preprocessor configuration |
processor_config.json |
โ | Processor configuration |
License
This quantization is released under the Apache 2.0 License, following the base model's license.
The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.
Citation
If you use this model in your research, please cite:
@misc{mistral-small-3.2-24b-autoround-w4a16,
title = {Mistral-Small-3.2-24B-Instruct-2506 AutoRound W4A16 Quantization},
author = {Gratex International},
year = {2026},
howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound}},
note = {Quantized with AutoRound 0.10.2}
}
Acknowledgments
This quantization was produced using hardware provided by Gratex International, a.s.
Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM
- Downloads last month
- 550
Model tree for gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503