Instructions to use gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- vLLM
How to use gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16
- SGLang
How to use gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16 with Docker Model Runner:
docker model run hf.co/gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Mistral-Small-3.2-24B-Instruct-2506 โ GPTQ W4A16 Quantization
This is a GPTQ W4A16 (4-bit weight-only) quantization of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities, quantized using AutoRound v0.12.3 with SignRound optimization (1000 iterations) and exported in GPTQ format for use with Marlin/CUTLASS kernels.
Model Details
| Property | Value |
|---|---|
| Base Model | mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
| Quantization Method | AutoRound (SignRound, W4A16) โ GPTQ export |
| Weight Precision | INT4 (group_size=128, symmetric, desc_act=false) |
| Activation Precision | FP16 (weight-only quantization) |
| Quantization Library | AutoRound 0.12.3 |
| Packing Format | auto_gptq (Marlin-compatible) |
| Architecture | Mistral3ForConditionalGeneration |
| LM Layers | 40 MistralDecoder layers |
| Hidden Size | 5120 |
| Intermediate Size | 32768 |
| Attention Heads | 32 (query), 8 (key/value, GQA) |
| Head Dimension | 128 |
| Vocabulary Size | 131,072 (LM head dimension; Tekken tokenizer contains 150,000 regular + 1,000 special tokens, but only 131,072 are used by the model) |
| Context Window | 131,072 tokens |
| Vision Encoder | Pixtral (24 layers, hidden_size=1024, patch_size=14) |
| Vision Projector | patch_merge (spatial_merge_size=2) |
| Quantized Components | Text decoder Linear layers only |
| Preserved in FP16 | Vision tower (all 24 layers), multi-modal projector, lm_head, embed_tokens, layer norms |
Quantization Configuration
{
"bits": 4,
"data_type": "int",
"group_size": 128,
"sym": true,
"batch_size": 4,
"iters": 1000,
"low_gpu_mem_usage": true,
"nsamples": 512,
"desc_act": false,
"true_sequential": false,
"damp_percent": 0.01,
"lm_head": false,
"autoround_version": "0.12.3",
"provider": "auto-round",
"quant_method": "gptq"
}
Key parameters:
- iters=1000: Maximum SignRound optimization steps per block (~5ร slower than default 200, best accuracy)
- nsamples=512: 512 calibration samples (4ร default of 128)
- sym=true: Symmetric quantization (no zero-point)
- group_size=128: Per-128-element scaling groups
- desc_act=false: No desc_act (activation-order reordering) โ required for Marlin kernel compatibility
- quant_method=gptq: Exported in GPTQ format (auto_gptq packing) for Marlin kernel acceleration
Calibration Dataset
The 512 calibration samples come from a domain-specific composite dataset (mistral_autoround_calib_slovak_insurance.jsonl) built for Gratex's insurance industry use case. All samples are text-only. Short samples are concatenated into chunks of โฅ7,000 characters (conservatively mapped from โฅ2,048 Tekken tokens). AutoRound's filter_func drops samples with fewer than seqlen tokens.
The dataset is built from the following sources:
| Source | Domain | HF ID / URL | License | Weight |
|---|---|---|---|---|
| Slovak language (~40%) | ||||
| FineWeb2 Slovak | General Slovak text | ivykopal/fineweb2-slovak |
ODC-By 1.0 | 2ร |
| Slovak Wikipedia | Slovak-language Wikipedia articles | Local: /data/skwiki-extracted |
CC-BY-SA 4.0 | 1ร |
| Insurance terminology (~30%) | ||||
| Insurance Contract Definitions | 6k+ English insurance term definitions | codexstanford/insurance-contract-definitions |
MIT | 2ร |
| Actuarial Ontology | Actuarial concepts (TTL format) | Actuarial-Notes/Actuarial-Ontology |
MIT | 1ร |
| Bitext Insurance Chatbot | 39k insurance QA pairs | bitext/Bitext-insurance-llm-chatbot-training-dataset |
CDLA-Sharing 1.0 | 2ร |
| Tool calling (~30%) | ||||
| Hermes Function Calling v1 | 100k+ ShareGPT tool-calling conversations | NousResearch/hermes-function-calling-v1 |
Apache 2.0 | 1ร |
| ToolACE | 11,300 rows, 26k diverse APIs | Team-ACE/ToolACE |
Apache 2.0 | 1ร |
| When2Call | When NOT to call tools | nvidia/When2Call |
CC-BY 4.0 | 1ร |
| General fill | ||||
| Pile-10k | General text | NeelNanda/pile-10k |
โ | 1ร |
Quality Benchmarks
All benchmarks use wikitext-2-raw-v1 (test split) โ the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.
WikiText-2 Perplexity (ctx=512)
Measured via vLLM completions API with echo=True + logprobs. 642 non-overlapping chunks, ~301K tokens scored. Methodology matches llama.cpp ./perplexity -c 512.
| Model | PPL | ฮ vs Base |
|---|---|---|
| Base BF16 | 7.0307 | โ |
| GPTQ W4A16 (this model) | 7.2620 | +3.29% |
KL Divergence vs BF16 (Static / Prefill)
KL divergence measures how much the output probability distribution has shifted from the base model. Lower is better; 0 = identical.
Methodology matches llama.cpp --kl-divergence: wikitext-2-raw-v1, ctx=512, score only the second half of each chunk (positions [256โ511]), which ensures every scored token has at least 256 tokens of left context. KLD direction: KL(P_base โ P_quant) โ "how well does the quantized model approximate the base?"
| Metric | Value |
|---|---|
| Mean KLD | 0.0793 |
| Median KLD | 0.0346 |
| 99th %ile KLD | 0.617 |
| 95th %ile KLD | 0.280 |
| ฮp RMS | 2.30% |
| Same top-p | 92.2% |
Same top-p = 92.2% means both quantized and base models agree on the most likely token 92.2% of the time.
Note on API-based KLD: These measurements use vLLM's top-20 logprobs per token (API limit), not full-vocab logits. This underestimates absolute KLD by ~10โ15% compared to llama.cpp's full-vocab computation (see mlx-kld analysis). The corrected full-vocab KLD is estimated at ~0.10. Relative comparisons between quantization methods remain valid regardless.
KL Divergence vs BF16 (Generation / Autoregressive)
KLD measured per generation step during autoregressive decoding. Step-0 = first generated token (comparable to static/prefill KLD). Later steps compound โ small per-token divergences accumulate as the two models diverge onto different token trajectories.
| Prompt Len | Gen Len | Step-0 KLD |
|---|---|---|
| 128 | 128 | 0.152 |
| 512 | 128 | 0.109 |
| 1024 | 128 | 0.115 |
Step-0 KLD (0.109โ0.152) is higher than the static prefill KLD (0.0793). This is expected: generation KLD uses greedy decoding which amplifies KLD at the first token (no averaging over many positions). Shorter prompts have less context โ more sensitive to quantization noise โ higher KLD.
vLLM Throughput (RTX 5090, 32 GB)
Single Request
| Metric | Value |
|---|---|
| Aggregate Throughput | 107.2 tok/s |
| Total tokens generated | 11,371 (20 requests ร up to 1,024 tokens) |
| Average Latency | 5.30 s |
| Min/Max Latency | 0.44 s / 9.25 s |
| Per-request Throughput | 3.5โ113.3 tok/s |
| Success Rate | 20/20 (100%) |
32 Concurrent Requests
| Metric | Value |
|---|---|
| Aggregate Throughput | 2,638.4 tok/s |
| Total tokens generated | 367,797 (640 requests ร up to 1,024 tokens) |
| Average Latency | 6.75 s (end-to-end per request, including queuing) |
| Min/Max Latency | 0.42 s / 11.85 s |
| Per-request Throughput | 2.1โ95.6 tok/s |
| Success Rate | 640/640 (100%) |
Each request generated up to 1,024 tokens. Average latency includes queuing time under concurrent load.
Hardware Requirements
| GPU VRAM | Recommended Config |
|---|---|
| 96 GB (RTX PRO 6000, H100, H200) | gpu_memory_utilization: 0.95, max_model_len: 131072, KV: fp8_e4m3 optional |
| 32 GB (RTX 5090) | gpu_memory_utilization: 0.96, max_model_len: 131072, kv_cache_dtype: fp8_e4m3, max_num_batched_tokens: 8192 |
Minimum: 1ร GPU with โฅ24 GB VRAM (with fp8 KV cache and reduced context).
GPTQ W4A16 with Marlin kernels requires Ampere or later (sm80+): A100, RTX 3090, RTX 4090, RTX 5090, RTX PRO 6000, H100, H200. Pre-Ampere GPUs (V100, GTX 1080) are NOT supported by Marlin kernels.
Usage with vLLM
Tested with: vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 (vLLM v0.19.2rc1.dev134)
Docker Deployment
docker run -d --name vllm-mistral-gptq \
--runtime=nvidia --gpus '"device=0"' \
-p 8000:8000 \
-v /path/to/model:/workspace/model \
-v /path/to/vllm_config.yaml:/vllm_config.yaml \
--ipc=host --shm-size=16g \
--restart unless-stopped \
vllm/vllm-openai:cu130-nightly-fe9c3d6c5f66c873d196800384ed6880687b9e52 \
--config /vllm_config.yaml
Example vLLM Configuration (YAML)
This configuration is deployed and verified on an RTX 5090 (32 GB, GPU 1, port 5006):
# vLLM Configuration โ Mistral-Small-3.2-24B GPTQ W4A16 (auto-round)
# -- Model & Server ----------------------------------------------------------
model: /workspace/model
host: "0.0.0.0"
port: 8000
served_model_name: "mistral-small-3.2-24b-gptq-W4A16-v1"
trust_remote_code: true
tensor_parallel_size: 1
# -- Quantization ------------------------------------------------------------
quantization: gptq_marlin
# -- Tokenizer & Config Format -----------------------------------------------
tokenizer_mode: mistral
config_format: mistral
# -- Data Type ---------------------------------------------------------------
# vLLM infers float16 from safetensors and casts to bfloat16 for computation.
# Both float16 and bfloat16 work; bfloat16 is recommended for Blackwell (sm120).
dtype: bfloat16
# -- Load Format -------------------------------------------------------------
load_format: auto
# -- Context & Batching ------------------------------------------------------
max_model_len: 131072
max_num_batched_tokens: 8192
max_num_seqs: 32
# -- Memory ------------------------------------------------------------------
gpu_memory_utilization: 0.96
enable_prefix_caching: true
enable_chunked_prefill: true
kv_cache_dtype: fp8_e4m3
# -- Multi-Modal -------------------------------------------------------------
limit_mm_per_prompt:
image: 4
# -- Tool Calling ------------------------------------------------------------
# Mistral native tool call format.
enable_auto_tool_choice: true
tool_call_parser: mistral
# -- Default Generation ------------------------------------------------------
generation_config: auto
override_generation_config:
temperature: 0.15
# -- Misc --------------------------------------------------------------------
disable_custom_all_reduce: true
Inference Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistral-small-3.2-24b-Instruct-GPTQ-W4A16","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'
Notes
Tokenizer: Tekken (Mistral-specific)
This model uses the Tekken tokenizer (tekken.json), not a standard HF tokenizer. You must use --tokenizer-mode mistral with vLLM to ensure correct tokenization. Using auto or hf mode produces garbled output.
Data Type: float16 weights, bfloat16 computation
The model weights are stored as float16 (set in config.json: torch_dtype: float16). vLLM infers float16 from the safetensors files and can cast to bfloat16 for computation on Blackwell (sm120) and later GPUs. Both --dtype float16 and --dtype bfloat16 work; bfloat16 is recommended for Blackwell. Older vLLM versions (< v0.19) may require float16 explicitly.
Config Format: mistral
Use --config-format mistral with this model. vLLM reads model architecture from params.json and quantization_config from config.json. The --config-format hf path triggers PixtralProcessor which produces a Token out of vocabulary error.
Vision: Image Size Limit
The tekken.json in this repository has max_image_size set to 1024 (down from the original 1540). Images with any dimension exceeding 1024px are proportionally downscaled before vision encoding.
Files in This Repository
| File | Size | Description |
|---|---|---|
model-00001-of-00005.safetensors |
~3.0 GB | Quantized LM layers + lm_head + embed_tokens (shard 1) |
model-00002-of-00005.safetensors |
~3.0 GB | Quantized LM layers (shard 2) |
model-00003-of-00005.safetensors |
~3.0 GB | Quantized LM layers (shard 3) |
model-00004-of-00005.safetensors |
~3.0 GB | Quantized LM layers (shard 4) |
model-00005-of-00005.safetensors |
~3.0 GB | Quantized LM layers + vision tower + projector (shard 5) |
model.safetensors.index.json |
โ | Shard index with weight map |
config.json |
โ | Model configuration with quantization_config |
params.json |
โ | Mistral-native parameter specification |
quantization_config.json |
โ | GPTQ quantization parameters |
tekken.json |
~15 MB | Tekken tokenizer (Mistral-specific) |
tokenizer.json |
~20 MB | HF-compatible tokenizer fallback |
tokenizer_config.json |
~22 MB | Tokenizer configuration |
generation_config.json |
โ | Generation parameters |
preprocessor_config.json |
โ | Image preprocessor configuration |
processor_config.json |
โ | Processor configuration |
License
This quantization is released under the Apache 2.0 License, following the base model's license.
The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.
Citation
If you use this model in your research, please cite:
@misc{mistral-small-3.2-24b-gptq-w4a16,
title = {Mistral-Small-3.2-24B-Instruct-2506 GPTQ W4A16 Quantization},
author = {Gratex International},
year = {2026},
howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GPTQ-W4A16-AutoRound}},
note = {Quantized with AutoRound 0.12.3, exported in GPTQ format}
}
Acknowledgments
This quantization was produced using hardware provided by Gratex International, a.s.
Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: AutoRound Deployment Engine: vLLM
- Downloads last month
- 136
Model tree for gratex/mistral-small-3.2-24b-Instruct-GPTQ-W4A16
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503