pixtral-12b — W4A16 (compressed-tensors)

Standard W4A16 quantization of mgoin/pixtral-12b, produced with llm-compressor (the official vLLM-team quantization toolkit) inside a reproducible Docker container. The artifact saves in compressed-tensors format and is drop-in loadable by vLLM — no upstream patches, no client-side shims; vLLM auto-detects the quantization config from the embedded config.json at load time.

This release is part of an ongoing series of vLLM-friendly quantized packs maintained by the atlas self-evolving agent project, run by Alex Adamopoulos at assert.gr.

Reproducibility

Parameter Value
Source model mgoin/pixtral-12b
Quantization tool llm-compressor 0.12.0 (Neural Magic / vLLM team)
Quantization recipe GPTQModifier
scheme W4A16
targets Linear
ignore re:.*lm_head, re:.*vision_tower.*, re:.*multi_modal_projector.*
sequential_targets
Calibration dataset lmms-lab/flickr30k
Calibration samples 512
max_seq_length 2048
Quantized size 8.55 GiB
Quantization time 0.0 min (dual RTX 3090)

The pipeline that produced this artifact lives at tools/quantize/ in the atlas repository — see the README there for the full Docker build + run sequence and the per-model env-var recipes.

License

Inherits the license of the base model. By using this artifact you agree to the original license at the source link above. Atlas / assert.gr adds no additional restrictions on the quantized weights.

Usage with vLLM

docker run --runtime=nvidia --gpus all \
    -p 8000:8000 \
    -e HF_TOKEN=hf_XXX \
    vllm/vllm-openai:latest \
    --model aleada/Pixtral-12B-W4A16 \
    --limit-mm-per-prompt 'image=1' \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching

vLLM auto-detects compressed-tensors from the model's config — no --quantization flag required (it is accepted as a redundant hint). vLLM also picks the model's full native context window from config.json (e.g. 128k for Phi-4-mini, 16k for Phi-4 14B). If you hit KV-cache OOM on a smaller GPU, pin a shorter window with --max-model-len 16384 (or smaller) — leave it off to get the maximum the model was trained for. Once vLLM is running, hit it with any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="aleada/Pixtral-12B-W4A16",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

Hardware target

Requires CUDA compute-capability ≥ 8.0 (Ampere or newer). Verified on NVIDIA RTX 3090 (compute 8.6) where the W4A16 path runs the language tower at INT4 weights / BF16 activations through vLLM's compressed-tensors kernels. Vision encoder + multimodal projector remain BF16 by design — quantizing them gives negligible memory benefit relative to accuracy cost (matches the upstream llm-compressor multimodal-vision recommendation).

About the maintainer

Alex Adamopoulos is the founder of assert.gr and the engineer behind the atlas self-evolving AI agent platform. Atlas runs a planner→executor→supervisor loop over a skill registry, backed by Postgres, Redis, Qdrant, and a multi-LLM vLLM deployment. Quantization releases like this one keep the open-source VLM ecosystem usable on consumer-grade hardware for self-hosted agent research.

Connect:

Downloads last month
62
Safetensors
Model size
13B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aleada/Pixtral-12B-W4A16

Quantized
(4)
this model