Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Mistral-Small-3.2-24B-Instruct-2506 — GPTQ W4A16 Quantization

This is a GPTQ W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.12.3 with SignRound optimization (1000 iterations) and exported in GPTQ format for use with Marlin/CUTLASS kernels.

Model Details

Property	Value
Base Model	mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method	AutoRound (SignRound, W4A16) → GPTQ export
Weight Precision	INT4 (group_size=128, symmetric, desc_act=false)
Activation Precision	FP16 (weight-only quantization)
Quantization Library	AutoRound 0.12.3
Packing Format	auto_gptq (Marlin-compatible)
Architecture	Mistral3ForConditionalGeneration
LM Layers	40 MistralDecoder layers
Hidden Size	5120
Intermediate Size	32768
Attention Heads	32 (query), 8 (key/value, GQA)
Head Dimension	128
Vocabulary Size	131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window	131,072 tokens
Vision Encoder	Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector	patch_merge (spatial_merge_size=2)
Quantized Components	Text decoder Linear layers only
Preserved in FP16	Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "bits": 4,
  "data_type": "int",
  "group_size": 128,
  "sym": true,
  "batch_size": 4,
  "iters": 1000,
  "low_gpu_mem_usage": true,
  "nsamples": 512,
  "desc_act": false,
  "true_sequential": false,
  "damp_percent": 0.01,
  "lm_head": false,
  "autoround_version": "0.12.3",
  "provider": "auto-round",
  "quant_method": "gptq"
}

Key parameters:

iters=1000: Maximum SignRound optimization steps per block (~5× slower than default 200, best accuracy)
nsamples=512: 512 calibration samples (4× default of 128)
sym=true: Symmetric quantization (no zero-point)
group_size=128: Per-128-element scaling groups
desc_act=false: No desc_act (activation-order reordering) — required for Marlin kernel compatibility
quant_method=gptq: Exported in GPTQ format (auto_gptq packing) for Marlin kernel acceleration

Calibration Dataset

The 512 calibration samples come from a domain-specific composite dataset (mistral_autoround_calib_slovak_insurance.jsonl) built for Gratex's insurance industry use case. All samples are text-only. Short samples are concatenated into chunks of ≥7,000 characters (conservatively mapped from ≥2,048 Tekken tokens). AutoRound's filter_func drops samples with fewer than seqlen tokens.

The dataset is built from the following sources:

Source	Domain	HF ID / URL	License	Weight
Slovak language (~40%)
FineWeb2 Slovak	General Slovak text	`ivykopal/fineweb2-slovak`	ODC-By 1.0	2×
Slovak Wikipedia	Slovak-language Wikipedia articles	Local: `/data/skwiki-extracted`	CC-BY-SA 4.0	1×
Insurance terminology (~30%)
Insurance Contract Definitions	6k+ English insurance term definitions	`codexstanford/insurance-contract-definitions`	MIT	2×
Actuarial Ontology	Actuarial concepts (TTL format)	`Actuarial-Notes/Actuarial-Ontology`	MIT	1×
Bitext Insurance Chatbot	39k insurance QA pairs	`bitext/Bitext-insurance-llm-chatbot-training-dataset`	CDLA-Sharing 1.0	2×
Tool calling (~30%)
Hermes Function Calling v1	100k+ ShareGPT tool-calling conversations	`NousResearch/hermes-function-calling-v1`	Apache 2.0	1×
ToolACE	11,300 rows, 26k diverse APIs	`Team-ACE/ToolACE`	Apache 2.0	1×
When2Call	When NOT to call tools	`nvidia/When2Call`	CC-BY 4.0	1×
General fill
Pile-10k	General text	`NeelNanda/pile-10k`	—	1×

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model	PPL	Δ vs Base
Base BF16	7.0307	—
GPTQ W4A16 (this model)	7.2620	+3.29%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256–511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"

Metric	Value
Mean KLD	0.0793
Median KLD	0.0346
99th %ile KLD	0.617
95th %ile KLD	0.280
Δp RMS	2.30%
Same top-p	92.2%

Same top-p = 92.2% means both quantized and base models agree on the most likely token 92.2% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10–15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound — small per-token divergences accumulate as the two models diverge onto different token trajectories.

Prompt Len	Gen Len	Step-0 KLD
128	128	0.152
512	128	0.109
1024	128	0.115

Step-0 KLD (0.109–0.152) is higher than the static prefill KLD (0.0793). This is expected: generation KLD uses greedy decoding which amplifies KLD at the first token (no averaging over many positions). Shorter prompts have less context → more sensitive to quantization noise → higher KLD.

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric	Value
Aggregate Throughput	107.2 tok/s
Total tokens generated	11,371 (20 requests × up to 1,024 tokens)
Average Latency	5.30 s
Min/Max Latency	0.44 s / 9.25 s
Per-request Throughput	3.5–113.3 tok/s
Success Rate	20/20 (100%)

32 Concurrent Requests

Metric	Value
Aggregate Throughput	2,638.4 tok/s
Total tokens generated	367,797 (640 requests × up to 1,024 tokens)
Average Latency	6.75 s (end-to-end per request, including queuing)
Min/Max Latency	0.42 s / 11.85 s
Per-request Throughput	2.1–95.6 tok/s
Success Rate	640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM	Recommended Config
96 GB (RTX PRO 6000, H100, H200)	`gpu_memory_utilization: 0.95`, `max_model_len: 131072`, KV: fp8_e4m3 optional
32 GB (RTX 5090)	`gpu_memory_utilization: 0.96`, `max_model_len: 131072`, `kv_cache_dtype: fp8_e4m3`, `max_num_batched_tokens: 8192`

Minimum: 1× GPU with ≥24 GB VRAM (with fp8 KV cache and reduced context).

GPTQ W4A16 with Marlin kernels requires Ampere or later (sm80+): A100, RTX 3090, RTX 4090, RTX 5090, RTX PRO 6000, H100, H200. Pre-Ampere GPUs (V100, GTX 1080) are NOT supported by Marlin kernels.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-gptq \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  --ipc=host --shm-size=16g \
  --restart unless-stopped \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  --config /vllm_config.yaml

Example vLLM Configuration (YAML)

This configuration is deployed and verified on an RTX 5090 (32 GB, GPU 1, port 5006):

# vLLM Configuration — Mistral-Small-3.2-24B GPTQ W4A16 (auto-round)

# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-gptq-W4A16-v1"
trust_remote_code: true
tensor_parallel_size: 1

# -- Quantization ------------------------------------------------------------
quantization: gptq_marlin

# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral

# -- Data Type ---------------------------------------------------------------
# vLLM infers float16 from safetensors and casts to bfloat16 for computation.
# Both float16 and bfloat16 work; bfloat16 is recommended for Blackwell (sm120).
dtype: bfloat16

# -- Load Format -------------------------------------------------------------
load_format: auto

# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3

# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
  image: 4

# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral

# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
  temperature: 0.15

# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-Instruct-GPTQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Data Type: float16 weights, bfloat16 computation

The model weights are stored as float16 (set in config.json: torch_dtype: float16). vLLM infers float16 from the safetensors files and can cast to bfloat16 for computation on Blackwell (sm120) and later GPUs. Both --dtype float16 and --dtype bfloat16 work; bfloat16 is recommended for Blackwell. Older vLLM versions (< v0.19) may require float16 explicitly.

Config Format: mistral

Use --config-format mistral with this model. vLLM reads model architecture from params.json and quantization_config from config.json. The --config-format hf path triggers PixtralProcessor which produces a Token out of vocabulary error.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Files in This Repository

File	Size	Description
`model-00001-of-00005.safetensors`	~3.0 GB	Quantized LM layers + lm_head + embed_tokens (shard 1)
`model-00002-of-00005.safetensors`	~3.0 GB	Quantized LM layers (shard 2)
`model-00003-of-00005.safetensors`	~3.0 GB	Quantized LM layers (shard 3)
`model-00004-of-00005.safetensors`	~3.0 GB	Quantized LM layers (shard 4)
`model-00005-of-00005.safetensors`	~3.0 GB	Quantized LM layers + vision tower + projector (shard 5)
`model.safetensors.index.json`	—	Shard index with weight map
`config.json`	—	Model configuration with quantization_config
`params.json`	—	Mistral-native parameter specification
`quantization_config.json`	—	GPTQ quantization parameters
`tekken.json`	~15 MB	Tekken tokenizer (Mistral-specific)
`tokenizer.json`	~20 MB	HF-compatible tokenizer fallback
`tokenizer_config.json`	~22 MB	Tokenizer configuration
`generation_config.json`	—	Generation parameters
`preprocessor_config.json`	—	Image preprocessor configuration
`processor_config.json`	—	Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-gptq-w4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 GPTQ W4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-W4A16-AutoRound}},
  note = {Quantized with AutoRound 0.12.3, exported in GPTQ format}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.

Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM

Downloads last month: 136

Safetensors

Model size

24B params

Tensor type

I32

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Quantized

(60)

this model