Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

Mistral-Small-3.2-24B-Instruct-2506 โ€” GPTQ W4A16 Quantization

This is a GPTQ W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.12.3 with SignRound optimization (1000 iterations) and exported in GPTQ format for use with Marlin/CUTLASS kernels.

Model Details

Property Value
Base Model mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Method AutoRound (SignRound, W4A16) โ†’ GPTQ export
Weight Precision INT4 (group_size=128, symmetric, desc_act=false)
Activation Precision FP16 (weight-only quantization)
Quantization Library AutoRound 0.12.3
Packing Format auto_gptq (Marlin-compatible)
Architecture Mistral3ForConditionalGeneration
LM Layers 40 MistralDecoder layers
Hidden Size 5120
Intermediate Size 32768
Attention Heads 32 (query), 8 (key/value, GQA)
Head Dimension 128
Vocabulary Size 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model)
Context Window 131,072 tokens
Vision Encoder Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector patch_merge (spatial_merge_size=2)
Quantized Components Text decoder Linear layers only
Preserved in FP16 Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms

Quantization Configuration

{
  "bits": 4,
  "data_type": "int",
  "group_size": 128,
  "sym": true,
  "batch_size": 4,
  "iters": 1000,
  "low_gpu_mem_usage": true,
  "nsamples": 512,
  "desc_act": false,
  "true_sequential": false,
  "damp_percent": 0.01,
  "lm_head": false,
  "autoround_version": "0.12.3",
  "provider": "auto-round",
  "quant_method": "gptq"
}

Key parameters:

  • iters=1000: Maximum SignRound optimization steps per block (~5ร— slower than default 200, best accuracy)
  • nsamples=512: 512 calibration samples (4ร— default of 128)
  • sym=true: Symmetric quantization (no zero-point)
  • group_size=128: Per-128-element scaling groups
  • desc_act=false: No desc_act (activation-order reordering) โ€” required for Marlin kernel compatibility
  • quant_method=gptq: Exported in GPTQ format (auto_gptq packing) for Marlin kernel acceleration

Calibration Dataset

The 512 calibration samples come from a domain-specific composite dataset (mistral_autoround_calib_slovak_insurance.jsonl) built for Gratex's insurance industry use case. All samples are text-only. Short samples are concatenated into chunks of โ‰ฅ7,000 characters (conservatively mapped from โ‰ฅ2,048 Tekken tokens). AutoRound's filter_func drops samples with fewer than seqlen tokens.

The dataset is built from the following sources:

Source Domain HF ID / URL License Weight
Slovak language (~40%)
FineWeb2 Slovak General Slovak text ivykopal/fineweb2-slovak ODC-By 1.0 2ร—
Slovak Wikipedia Slovak-language Wikipedia articles Local: /data/skwiki-extracted CC-BY-SA 4.0 1ร—
Insurance terminology (~30%)
Insurance Contract Definitions 6k+ English insurance term definitions codexstanford/insurance-contract-definitions MIT 2ร—
Actuarial Ontology Actuarial concepts (TTL format) Actuarial-Notes/Actuarial-Ontology MIT 1ร—
Bitext Insurance Chatbot 39k insurance QA pairs bitext/Bitext-insurance-llm-chatbot-training-dataset CDLA-Sharing 1.0 2ร—
Tool calling (~30%)
Hermes Function Calling v1 100k+ ShareGPT tool-calling conversations NousResearch/hermes-function-calling-v1 Apache 2.0 1ร—
ToolACE 11,300 rows, 26k diverse APIs Team-ACE/ToolACE Apache 2.0 1ร—
When2Call When NOT to call tools nvidia/When2Call CC-BY 4.0 1ร—
General fill
Pile-10k General text NeelNanda/pile-10k โ€” 1ร—

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) โ€” the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.

Model PPL ฮ” vs Base
Base BF16 7.0307 โ€”
GPTQ W4A16 (this model) 7.2620 +3.29%

KL Divergence vs BF16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.

Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256โ€“511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base โ€– P_quant) โ€” "how well does the quantized model approximate the base?"

Metric Value
Mean KLD 0.0793
Median KLD 0.0346
99th %ile KLD 0.617
95th %ile KLD 0.280
ฮ”p RMS 2.30%
Same top-p 92.2%

Same top-p = 92.2% means both quantized and base models agree on the most likely token 92.2% of the time.

Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10โ€“15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10. Relative comparisons between quantization methods remain valid regardless.

KL Divergence vs BF16 (Generation / Autoregressive)

KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound โ€” small per-token divergences accumulate as the two models diverge onto different token trajectories.

Prompt Len Gen Len Step-0 KLD
128 128 0.152
512 128 0.109
1024 128 0.115

Step-0 KLD (0.109โ€“0.152) is higher than the static prefill KLD (0.0793). This is expected: generation KLD uses greedy decoding which amplifies KLD at the first token (no averaging over many positions). Shorter prompts have less context โ†’ more sensitive to quantization noise โ†’ higher KLD.

vLLM Throughput (RTX 5090, 32 GB)

Single Request

Metric Value
Aggregate Throughput 107.2 tok/s
Total tokens generated 11,371 (20 requests ร— up to 1,024 tokens)
Average Latency 5.30 s
Min/Max Latency 0.44 s / 9.25 s
Per-request Throughput 3.5โ€“113.3 tok/s
Success Rate 20/20 (100%)

32 Concurrent Requests

Metric Value
Aggregate Throughput 2,638.4 tok/s
Total tokens generated 367,797 (640 requests ร— up to 1,024 tokens)
Average Latency 6.75 s (end-to-end per request, including queuing)
Min/Max Latency 0.42 s / 11.85 s
Per-request Throughput 2.1โ€“95.6 tok/s
Success Rate 640/640 (100%)

Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.

Hardware Requirements

GPU VRAM Recommended Config
96 GB (RTX PRO 6000, H100, H200) gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional
32 GB (RTX 5090) gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192

Minimum: 1ร— GPU with โ‰ฅ24 GB VRAM (with fp8 KV cache and reduced context).

GPTQ W4A16 with Marlin kernels requires Ampere or later (sm80+): A100, RTX 3090, RTX 4090, RTX 5090, RTX PRO 6000, H100, H200. Pre-Ampere GPUs (V100, GTX 1080) are NOT supported by Marlin kernels.

Usage with vLLM

Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)

Docker Deployment

docker run -d --name vllm-mistral-gptq \
  --runtime=nvidia --gpus '"device=0"' \
  -p 8000:8000 \
  -v /path/to/model:/workspace/model \
  -v /path/to/vllm_config.yaml:/vllm_config.yaml \
  --ipc=host --shm-size=16g \
  --restart unless-stopped \
  vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
  --config /vllm_config.yaml

Example vLLM Configuration (YAML)

This configuration is deployed and verified on an RTX 5090 (32 GB, GPU 1, port 5006):

# vLLM Configuration โ€” Mistral-Small-3.2-24B GPTQ W4A16 (auto-round)

# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-gptq-W4A16-v1"
trust_remote_code: true
tensor_parallel_size: 1

# -- Quantization ------------------------------------------------------------
quantization: gptq_marlin

# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral

# -- Data Type ---------------------------------------------------------------
# vLLM infers float16 from safetensors and casts to bfloat16 for computation.
# Both float16 and bfloat16 work; bfloat16 is recommended for Blackwell (sm120).
dtype: bfloat16

# -- Load Format -------------------------------------------------------------
load_format: auto

# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32

# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3

# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
  image: 4

# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral

# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
  temperature: 0.15

# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-Instruct-GPTQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Notes

Tokenizer: Tekken (Mistral-specific)

This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.

Data Type: float16 weights, bfloat16 computation

The model weights are stored as float16 (set in config.json: torch_dtype: float16). vLLM infers float16 from the safetensors files and can cast to bfloat16 for computation on Blackwell (sm120) and later GPUs. Both --dtype float16 and --dtype bfloat16 work; bfloat16 is recommended for Blackwell. Older vLLM versions (< v0.19) may require float16 explicitly.

Config Format: mistral

Use --config-format mistral with this model. vLLM reads model architecture from params.json and quantization_config from config.json. The --config-format hf path triggers PixtralProcessor which produces a Token out of vocabulary error.

Vision: Image Size Limit

The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.

Files in This Repository

File Size Description
model-00001-of-00005.safetensors ~3.0 GB Quantized LM layers + lm_head + embed_tokens (shard 1)
model-00002-of-00005.safetensors ~3.0 GB Quantized LM layers (shard 2)
model-00003-of-00005.safetensors ~3.0 GB Quantized LM layers (shard 3)
model-00004-of-00005.safetensors ~3.0 GB Quantized LM layers (shard 4)
model-00005-of-00005.safetensors ~3.0 GB Quantized LM layers + vision tower + projector (shard 5)
model.safetensors.index.json โ€” Shard index with weight map
config.json โ€” Model configuration with quantization_config
params.json โ€” Mistral-native parameter specification
quantization_config.json โ€” GPTQ quantization parameters
tekken.json ~15 MB Tekken tokenizer (Mistral-specific)
tokenizer.json ~20 MB HF-compatible tokenizer fallback
tokenizer_config.json ~22 MB Tokenizer configuration
generation_config.json โ€” Generation parameters
preprocessor_config.json โ€” Image preprocessor configuration
processor_config.json โ€” Processor configuration

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-gptq-w4a16,
  title = {Mistral-Small-3.2-24B-Instruct-2506 GPTQ W4A16 Quantization},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-W4A16-AutoRound}},
  note = {Quantized with AutoRound 0.12.3, exported in GPTQ format}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.


Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM

Downloads last month
136
Safetensors
Model size
24B params
Tensor type
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16