Mistral-Small-4-119B-Heretic-NVFP4

NVFP4 (W4A4) quantization of darkc0de/Mistral-Small-4-119B-2603-heretic — an abliterated Mistral Small 4 (119B, MoE).

To our knowledge this is the first publicly available NVFP4 quant of a Mistral 4 model outside RedHat's toolchain. ~70 GB (down from ~238 GB BF16), runnable on a single 96–128 GB Blackwell GPU (e.g. NVIDIA GB10 / DGX Spark).

⚠️ READ THE "Serving" SECTION FIRST. This checkpoint is in HuggingFace compressed-tensors format, which does not load on stock vLLM ≤ 0.20.1 (no mistral4 HF text-model class → ValueError: No model architectures are specified). You either need a newer vLLM with Mistral 4 HF support, or convert to native Mistral format with the script in recipes/ (takes ~6 min, pure tensor-rename). Full instructions below.

What this is

Base darkc0de/Mistral-Small-4-119B-2603-heretic (abliterated Mistral Small 4, 119B MoE — 36 layers, 128 experts, 4 active + 1 shared, MLA attention)
Scheme NVFP4 W4A4, compressed-tensors nvfp4-pack-quantized (4-bit weights, 4-bit dynamic activations, group size 16)
Quantized All routed + shared expert Linears (the bulk of the weights)
Kept BF16 attention (MLA), router/gate, lm_head, vision tower, mm-projector
Tooling llm-compressor + a custom CalibrationMistral4MoE wrapper (see Reproduction)

⚙️ Serving

Mistral 4 uses MLA attention + grouped MoE (DeepSeek-V2-like). Two things bite:

  1. Format. Stock vLLM 0.20.1 has Mistral3ForConditionalGeneration but no Mistral4 HF text class, so this HF checkpoint won't resolve. Convert to native Mistral format (consolidated-*.safetensors + params.json + tekken.json) — it's a byte-identical tensor rename (the nvfp4-pack-quantized payload is the same), see recipes/convert-mistral4-to-native.py. Keep the vision tower (splice the base model's BF16 vision tensors) so vLLM resolves it as PixtralForConditionalGeneration — strip vision and it falls back to deepseek_v2 and the MLA path breaks (see #2).

  2. ⭐ On Blackwell (SM120/SM121 / GB10), you MUST set the env below. Without VLLM_MLA_DISABLE=1, the only available MLA backend is TRITON_MLA, whose decode kernel crashes on Mistral 4's kv_lora_rank=256 with ValueError: Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512. Disabling MLA routes attention through FLASH_ATTN, which works.

# proven on GB10 / vLLM 0.20.1 — see recipes/serve-mistral4-heretic-native.sh
VLLM_MLA_DISABLE=1 \            # ⭐ FLASH_ATTN instead of the broken TRITON_MLA decode kernel
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_USE_FLASHINFER_MOE_FP4=0 \ # -> MARLIN MoE (SM12x-stable), not flashinfer-cutlass
TORCH_CUDA_ARCH_LIST=12.1a ENABLE_NVFP4_SM100=0 \
VLLM_ENGINE_CORE_STARTUP_TIMEOUT=600 \
vllm serve ./heretic-native \
  --load-format mistral --config-format mistral --tokenizer-mode mistral \
  --tensor-parallel-size 1 --gpu-memory-utilization 0.85 \
  --enable-auto-tool-choice --tool-call-parser mistral

A healthy boot logs Resolved architecture: PixtralForConditionalGeneration, Using FLASH_ATTN, and Using 'MARLIN' NvFp4 MoE backend.

On datacenter Blackwell (B200/SM100) and other arches the MLA path may work without VLLM_MLA_DISABLE; the env above is the GB10/SM12x-validated recipe.

✅ Validation

Served on GB10 (vLLM 0.20.1) and smoke-tested:

  • Coherent — clean generation, no !!!! (the failure mode of weight-only NVFP4-A16 on vLLM)
  • Abliteration intact — answers blunt prompts directly, no refusal scaffolding
  • Digit-precision verbatim — exact figures/IPs/CVEs survive the W4A4 quant (the main risk for a quant used to quote stats)

🔁 Reproduction

The build's real blocker wasn't secret — it was a missing ~200-line MoE calibration wrapper plus pipeline="basic". See recipes/:

  • mistral4-moe-wrapper.pyCalibrationMistral4MoE (un-fuses Mistral4NaiveMoe into per-expert FX-traceable MLPs for calibration). Upstream-PR candidate for llm-compressor modeling/mistral4.py.
  • quant-mistral4-heretic.py — the oneshot() quant (transformers 5.5.3, llm-compressor PR #2608, pipeline="basic", NVFP4, the ignore list above)
  • convert-mistral4-to-native.py / extract-vanilla-vision.py / validate-namemap.py — HF→native conversion + vision splice
  • serve-mistral4-heretic-native.sh — the full GB10 serve recipe

Key gotchas: load via AutoModelForImageTextToText + AutoTokenizer (no preprocessor_config.json); pipeline="basic" (FX tracing dies on modern-transformers attention dispatch); compress exactly once (output_dir or save_pretrained, not both).

Credits & caveats

  • Base abliteration: darkc0de; base model: Mistral AI (Mistral Small 4).
  • Abliterated model — refusal behavior is reduced. You are responsible for how you use it. Inherits the base model's license and limitations.
  • Provided as-is, no warranty. NVFP4 is lossy; validate for your use case.
Downloads last month
212
Safetensors
Model size
68B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GulfCoastAI/Mistral-Small-4-119B-Heretic-NVFP4