Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Mistral-Small-3.2-24B-Instruct-2506 โ€” AutoRound W4A16 Quantization

This is a W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.10.2 with SignRound optimization (1000 iterations).

AutoRound W4A16 stores weights as INT4 with BF16 activations. Weights are dequantized via GPTQ-style group quantization (group_size=128) before matrix multiply. This format is widely supported across GPU architectures (Ampere, Ada, Blackwell) and inference engines (vLLM, SGLang, TensorRT-LLM).

Model Details

Property Value
Base Model mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method AutoRound (SignRound, W4A16)
Weight Precision INT4 (group_size=128, symmetric)
Activation Precision BF16 (weight-only quantization)
Quantization Library AutoRound 0.10.2
Packing Format auto_round:auto_gptq
Architecture Mistral3ForConditionalGeneration
LM Layers 40 MistralDecoder layers
Hidden Size 5120
Intermediate Size 32768
Attention Heads 32 (query), 8 (key/value, GQA)
Head Dimension 128
Vocabulary Size 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window 131,072 tokens
Vision Encoder Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector patch_merge (spatial_merge_size=2)
Quantized Components Text decoder Linear layers only
Preserved in BF16 Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "bits": 4,
  "data_type": "int",
  "group_size": 128,
  "sym": true,
  "batch_size": 4,
  "iters": 1000,
  "low_gpu_mem_usage": true,
  "nsamples": 512,
  "block_name_to_quantize": "model.language_model.layers",
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq"
}

Key parameters:

  • iters=1000: Maximum SignRound optimization steps per block (~5ร— slower than default 200, best accuracy)
  • nsamples=512: 512 calibration samples (4ร— default of 128)
  • sym=true: Symmetric quantization (no zero-point)
  • group_size=128: Per-128-element scaling groups

Calibration Dataset

The 512 calibration samples were built from a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors in the llm-compressor environment). Short samples were concatenated into chunks of โ‰ฅ2,048 Tekken tokens each:

Source Domain HF ID
Magicoder-Evol-Instruct Coding (instruction + response pairs) ise-uiuc/Magicoder-Evol-Instruct-110K
xLAM Function Calling Tool/function calling (query + tools + answers) Salesforce/xlam-function-calling-60k
Hermes Function Calling v1 Tool calling (ShareGPT format conversations) NousResearch/hermes-function-calling-v1
Pile-10k General reasoning and knowledge NeelNanda/pile-10k
Domain instructions Coding + tool calling (local file, 5ร— duplicated for weight) Local: imatrix_mistral_domain_calib_5x.txt

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) โ€” the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model PPL ฮ” vs Base
Base BF16 7.0332 โ€”
AutoRound W4A16 (this model) 7.2478 +3.05%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256โ€“511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base โ€– P_quant) โ€” "how well does the quantized model approximate the base?"

Metric Value
Mean KLD 0.0746
Median KLD 0.0279
99th %ile KLD 0.592
95th %ile KLD 0.286
ฮ”p RMS 2.12%
Same top-p 92.6%

Same top-p = 92.6% means both quantized and base models agree on the most likely token 92.6% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10โ€“15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.09. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound โ€” small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.

Prompt Len Gen Len Step-0 KLD
128 128 0.072
512 128 0.080
1024 128 0.084

Step-0 KLD (0.072โ€“0.084) is consistent with the static prefill KLD (0.0746), with variance driven by prompt length. Shorter prompts have less context โ†’ more sensitive to quantization noise.

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric Value
Aggregate Throughput 107.6 tok/s
Total tokens generated 11,251 (20 requests ร— up to 1,024 tokens)
Average Latency 5.23 s
Min/Max Latency 0.29 s / 9.35 s
Per-request Throughput 9.2โ€“110.8 tok/s
Success Rate 20/20 (100%)

32 Concurrent Requests

Metric Value
Aggregate Throughput 2,604.0 tok/s
Total tokens generated 354,430 (640 requests ร— up to 1,024 tokens)
Average Latency 6.51 s (end-to-end per request, including queuing)
Min/Max Latency 0.23 s / 13.32 s
Per-request Throughput 2.3โ€“99.6 tok/s
Success Rate 640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM Recommended Config
96 GB (RTX PRO 6000, H100, H200) gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional
32 GB (RTX 5090) gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192

Minimum: 1ร— GPU with โ‰ฅ24 GB VRAM (with fp8 KV cache and reduced context).

AutoRound W4A16 works on all GPU architectures (sm75+): Ampere (A100, RTX 3090), Ada (RTX 4090), Blackwell (RTX 5090, RTX PRO 6000). No architecture-specific tensor cores required โ€” dequantization is handled by Marlin/CUTLASS kernels.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-autoround \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  --ipc=host --shm-size=16g \
  --restart unless-stopped \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  --config /vllm_config.yaml

Example vLLM Configuration (YAML)

This configuration is deployed and verified on an RTX 5090 (32 GB):

# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-Instruct-AWQ-W4A16"
trust_remote_code: true
tensor_parallel_size: 1

# -- Quantization ------------------------------------------------------------
quantization: auto_round

# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral

# -- Data Type ---------------------------------------------------------------
dtype: bfloat16

# -- Load Format -------------------------------------------------------------
load_format: auto

# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3

# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
  image: 4

# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral

# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
  temperature: 0.15

# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-Instruct-AWQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Files in This Repository

File Size Description
model-00001-of-00005.safetensors ~2.9 GB Quantized LM layers (shard 1)
model-00002-of-00005.safetensors ~2.9 GB Quantized LM layers (shard 2)
model-00003-of-00005.safetensors ~2.9 GB Quantized LM layers (shard 3)
model-00004-of-00005.safetensors ~2.9 GB Quantized LM layers (shard 4)
model-00005-of-00005.safetensors ~2.7 GB LM layers + vision tower + projector + lm_head (BF16)
model.safetensors.index.json โ€” Shard index with weight map
config.json โ€” Model configuration with quantization_config
params.json โ€” Mistral-native parameter specification
quantization_config.json โ€” AutoRound quantization parameters
tekken.json ~15 MB Tekken tokenizer (Mistral-specific)
tokenizer.json ~20 MB HF-compatible tokenizer fallback
tokenizer_config.json ~22 MB Tokenizer configuration
generation_config.json โ€” Generation parameters
preprocessor_config.json โ€” Image preprocessor configuration
processor_config.json โ€” Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-autoround-w4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 AutoRound W4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound}},
  note = {Quantized with AutoRound 0.10.2}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.


Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM

Downloads last month
550
Safetensors
Model size
4B params
Tensor type
I32
ยท
BF16
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound