Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Mistral-Small-3.2-24B-Instruct-2506 — AutoRound W4A16 Quantization

This is a W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.10.2 with SignRound optimization (1000 iterations).

AutoRound W4A16 stores weights as INT4 with BF16 activations. Weights are dequantized via GPTQ-style group quantization (group_size=128) before matrix multiply. This format is widely supported across GPU architectures (Ampere, Ada, Blackwell) and inference engines (vLLM, SGLang, TensorRT-LLM).

Model Details

Property	Value
Base Model	mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method	AutoRound (SignRound, W4A16)
Weight Precision	INT4 (group_size=128, symmetric)
Activation Precision	BF16 (weight-only quantization)
Quantization Library	AutoRound 0.10.2
Packing Format	auto_round:auto_gptq
Architecture	Mistral3ForConditionalGeneration
LM Layers	40 MistralDecoder layers
Hidden Size	5120
Intermediate Size	32768
Attention Heads	32 (query), 8 (key/value, GQA)
Head Dimension	128
Vocabulary Size	131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window	131,072 tokens
Vision Encoder	Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector	patch_merge (spatial_merge_size=2)
Quantized Components	Text decoder Linear layers only
Preserved in BF16	Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "bits": 4,
  "data_type": "int",
  "group_size": 128,
  "sym": true,
  "batch_size": 4,
  "iters": 1000,
  "low_gpu_mem_usage": true,
  "nsamples": 512,
  "block_name_to_quantize": "model.language_model.layers",
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq"
}

Key parameters:

iters=1000: Maximum SignRound optimization steps per block (~5× slower than default 200, best accuracy)
nsamples=512: 512 calibration samples (4× default of 128)
sym=true: Symmetric quantization (no zero-point)
group_size=128: Per-128-element scaling groups

Calibration Dataset

The 512 calibration samples were built from a domain-specific mix of text-only datasets (no images/video, to avoid torchvision import errors in the llm-compressor environment). Short samples were concatenated into chunks of ≥2,048 Tekken tokens each:

Source	Domain	HF ID
Magicoder-Evol-Instruct	Coding (instruction + response pairs)	`ise-uiuc/Magicoder-Evol-Instruct-110K`
xLAM Function Calling	Tool/function calling (query + tools + answers)	`Salesforce/xlam-function-calling-60k`
Hermes Function Calling v1	Tool calling (ShareGPT format conversations)	`NousResearch/hermes-function-calling-v1`
Pile-10k	General reasoning and knowledge	`NeelNanda/pile-10k`
Domain instructions	Coding + tool calling (local file, 5× duplicated for weight)	Local: `imatrix_mistral_domain_calib_5x.txt`

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model	PPL	Δ vs Base
Base BF16	7.0332	—
AutoRound W4A16 (this model)	7.2478	+3.05%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256–511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"

Metric	Value
Mean KLD	0.0746
Median KLD	0.0279
99th %ile KLD	0.592
95th %ile KLD	0.286
Δp RMS	2.12%
Same top-p	92.6%

Same top-p = 92.6% means both quantized and base models agree on the most likely token 92.6% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10–15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.09. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound — small per-token divergences accumulate as the two models diverge onto different token trajectories. This is normal and expected for 4-bit quantization.

Prompt Len	Gen Len	Step-0 KLD
128	128	0.072
512	128	0.080
1024	128	0.084

Step-0 KLD (0.072–0.084) is consistent with the static prefill KLD (0.0746), with variance driven by prompt length. Shorter prompts have less context → more sensitive to quantization noise.

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric	Value
Aggregate Throughput	107.6 tok/s
Total tokens generated	11,251 (20 requests × up to 1,024 tokens)
Average Latency	5.23 s
Min/Max Latency	0.29 s / 9.35 s
Per-request Throughput	9.2–110.8 tok/s
Success Rate	20/20 (100%)

32 Concurrent Requests

Metric	Value
Aggregate Throughput	2,604.0 tok/s
Total tokens generated	354,430 (640 requests × up to 1,024 tokens)
Average Latency	6.51 s (end-to-end per request, including queuing)
Min/Max Latency	0.23 s / 13.32 s
Per-request Throughput	2.3–99.6 tok/s
Success Rate	640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM	Recommended Config
96 GB (RTX PRO 6000, H100, H200)	`gpu_memory_utilization: 0.95`, `max_model_len: 131072`, KV: fp8_e4m3 optional
32 GB (RTX 5090)	`gpu_memory_utilization: 0.96`, `max_model_len: 131072`, `kv_cache_dtype: fp8_e4m3`, `max_num_batched_tokens: 8192`

Minimum: 1× GPU with ≥24 GB VRAM (with fp8 KV cache and reduced context).

AutoRound W4A16 works on all GPU architectures (sm75+): Ampere (A100, RTX 3090), Ada (RTX 4090), Blackwell (RTX 5090, RTX PRO 6000). No architecture-specific tensor cores required — dequantization is handled by Marlin/CUTLASS kernels.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-autoround \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  --ipc=host --shm-size=16g \
  --restart unless-stopped \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  --config /vllm_config.yaml

Example vLLM Configuration (YAML)

This configuration is deployed and verified on an RTX 5090 (32 GB):

# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-Instruct-AWQ-W4A16"
trust_remote_code: true
tensor_parallel_size: 1

# -- Quantization ------------------------------------------------------------
quantization: auto_round

# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral

# -- Data Type ---------------------------------------------------------------
dtype: bfloat16

# -- Load Format -------------------------------------------------------------
load_format: auto

# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3

# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
  image: 4

# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral

# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
  temperature: 0.15

# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-Instruct-AWQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Files in This Repository

File	Size	Description
`model-00001-of-00005.safetensors`	~2.9 GB	Quantized LM layers (shard 1)
`model-00002-of-00005.safetensors`	~2.9 GB	Quantized LM layers (shard 2)
`model-00003-of-00005.safetensors`	~2.9 GB	Quantized LM layers (shard 3)
`model-00004-of-00005.safetensors`	~2.9 GB	Quantized LM layers (shard 4)
`model-00005-of-00005.safetensors`	~2.7 GB	LM layers + vision tower + projector + lm_head (BF16)
`model.safetensors.index.json`	—	Shard index with weight map
`config.json`	—	Model configuration with quantization_config
`params.json`	—	Mistral-native parameter specification
`quantization_config.json`	—	AutoRound quantization parameters
`tekken.json`	~15 MB	Tekken tokenizer (Mistral-specific)
`tokenizer.json`	~20 MB	HF-compatible tokenizer fallback
`tokenizer_config.json`	~22 MB	Tokenizer configuration
`generation_config.json`	—	Generation parameters
`preprocessor_config.json`	—	Image preprocessor configuration
`processor_config.json`	—	Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-autoround-w4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 AutoRound W4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound}},
  note = {Quantized with AutoRound 0.10.2}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.

Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM

Downloads last month: 550

Safetensors

Model size

4B params

Tensor type

I32

BF16

F16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gratex/Mistral-Small-3.2-24B-Instruct-2506-W4A16-AutoRound

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Quantized

(60)

this model