Mistral-Small-4-119B-Heretic-NVFP4
NVFP4 (W4A4) quantization of darkc0de/Mistral-Small-4-119B-2603-heretic — an abliterated Mistral Small 4 (119B, MoE).
To our knowledge this is the first publicly available NVFP4 quant of a Mistral 4 model outside RedHat's toolchain. ~70 GB (down from ~238 GB BF16), runnable on a single 96–128 GB Blackwell GPU (e.g. NVIDIA GB10 / DGX Spark).
⚠️ READ THE "Serving" SECTION FIRST. This checkpoint is in HuggingFace
compressed-tensorsformat, which does not load on stock vLLM ≤ 0.20.1 (nomistral4HF text-model class →ValueError: No model architectures are specified). You either need a newer vLLM with Mistral 4 HF support, or convert to native Mistral format with the script inrecipes/(takes ~6 min, pure tensor-rename). Full instructions below.
What this is
| Base | darkc0de/Mistral-Small-4-119B-2603-heretic (abliterated Mistral Small 4, 119B MoE — 36 layers, 128 experts, 4 active + 1 shared, MLA attention) |
| Scheme | NVFP4 W4A4, compressed-tensors nvfp4-pack-quantized (4-bit weights, 4-bit dynamic activations, group size 16) |
| Quantized | All routed + shared expert Linears (the bulk of the weights) |
| Kept BF16 | attention (MLA), router/gate, lm_head, vision tower, mm-projector |
| Tooling | llm-compressor + a custom CalibrationMistral4MoE wrapper (see Reproduction) |
⚙️ Serving
Mistral 4 uses MLA attention + grouped MoE (DeepSeek-V2-like). Two things bite:
Format. Stock vLLM 0.20.1 has
Mistral3ForConditionalGenerationbut noMistral4HF text class, so this HF checkpoint won't resolve. Convert to native Mistral format (consolidated-*.safetensors+params.json+tekken.json) — it's a byte-identical tensor rename (thenvfp4-pack-quantizedpayload is the same), seerecipes/convert-mistral4-to-native.py. Keep the vision tower (splice the base model's BF16 vision tensors) so vLLM resolves it asPixtralForConditionalGeneration— strip vision and it falls back todeepseek_v2and the MLA path breaks (see #2).⭐ On Blackwell (SM120/SM121 / GB10), you MUST set the env below. Without
VLLM_MLA_DISABLE=1, the only available MLA backend isTRITON_MLA, whose decode kernel crashes on Mistral 4'skv_lora_rank=256withValueError: Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512. Disabling MLA routes attention throughFLASH_ATTN, which works.
# proven on GB10 / vLLM 0.20.1 — see recipes/serve-mistral4-heretic-native.sh
VLLM_MLA_DISABLE=1 \ # ⭐ FLASH_ATTN instead of the broken TRITON_MLA decode kernel
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_USE_FLASHINFER_MOE_FP4=0 \ # -> MARLIN MoE (SM12x-stable), not flashinfer-cutlass
TORCH_CUDA_ARCH_LIST=12.1a ENABLE_NVFP4_SM100=0 \
VLLM_ENGINE_CORE_STARTUP_TIMEOUT=600 \
vllm serve ./heretic-native \
--load-format mistral --config-format mistral --tokenizer-mode mistral \
--tensor-parallel-size 1 --gpu-memory-utilization 0.85 \
--enable-auto-tool-choice --tool-call-parser mistral
A healthy boot logs Resolved architecture: PixtralForConditionalGeneration, Using FLASH_ATTN, and Using 'MARLIN' NvFp4 MoE backend.
On datacenter Blackwell (B200/SM100) and other arches the MLA path may work without
VLLM_MLA_DISABLE; the env above is the GB10/SM12x-validated recipe.
✅ Validation
Served on GB10 (vLLM 0.20.1) and smoke-tested:
- Coherent — clean generation, no
!!!!(the failure mode of weight-only NVFP4-A16 on vLLM) - Abliteration intact — answers blunt prompts directly, no refusal scaffolding
- Digit-precision verbatim — exact figures/IPs/CVEs survive the W4A4 quant (the main risk for a quant used to quote stats)
🔁 Reproduction
The build's real blocker wasn't secret — it was a missing ~200-line MoE calibration wrapper plus pipeline="basic". See recipes/:
mistral4-moe-wrapper.py—CalibrationMistral4MoE(un-fusesMistral4NaiveMoeinto per-expert FX-traceable MLPs for calibration). Upstream-PR candidate for llm-compressormodeling/mistral4.py.quant-mistral4-heretic.py— theoneshot()quant (transformers 5.5.3, llm-compressor PR #2608,pipeline="basic", NVFP4, theignorelist above)convert-mistral4-to-native.py/extract-vanilla-vision.py/validate-namemap.py— HF→native conversion + vision spliceserve-mistral4-heretic-native.sh— the full GB10 serve recipe
Key gotchas: load via AutoModelForImageTextToText + AutoTokenizer (no preprocessor_config.json); pipeline="basic" (FX tracing dies on modern-transformers attention dispatch); compress exactly once (output_dir or save_pretrained, not both).
Credits & caveats
- Base abliteration: darkc0de; base model: Mistral AI (Mistral Small 4).
- Abliterated model — refusal behavior is reduced. You are responsible for how you use it. Inherits the base model's license and limitations.
- Provided as-is, no warranty. NVFP4 is lossy; validate for your use case.
- Downloads last month
- 212
Model tree for GulfCoastAI/Mistral-Small-4-119B-Heretic-NVFP4
Base model
mistralai/Mistral-Small-4-119B-2603