Behemoth-X-123B-v2.2 — NVFP4 (compressed-tensors)

NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of TheDrummer/Behemoth-X-123B-v2.2, produced via a custom 3-node heterogeneous distributed pipeline on a personal 2× NVIDIA DGX Spark + RTX 3090 setup.

To my knowledge this is the first publicly available NVFP4 quantization of a Mistral-Large-derived 123B-class model, and the first to demonstrate a heterogeneous Blackwell-UMA + Ampere-VRAM Ray cluster running modelopt's NVFP4 export pipeline end-to-end.


Quick facts

Base model TheDrummer/Behemoth-X-123B-v2.2 (Mistral-Large-2411 finetune)
Architecture MistralForCausalLM, 88 layers, hidden_size=12288, 96 attn heads, 8 KV heads, head_dim=128
Original size ~228 GB (BF16)
Quantized size ~66 GB (see Files tab)
Quant format NVFP4 via nvidia-modelopt 0.43.0
Storage layout compressed-tensors (vLLM-native)
lm_head Kept BF16 (unquantized), in quantization_config.ignore
KV cache Configurable at serve time (FP8 recommended)
Calibration data 256 samples from cnn_dailymail, lengths 150–1200 tokens
Conversion date 2026-05-14

Why this exists

Quantizing 123B-class models for personal hardware is not as turn-key as it sounds:

  1. Single-node modelopt fails on Spark. The standard modelopt hf_ptq.py workflow silently OOM-kills on the GB10 because accelerate.infer_auto_device_map misdetects unified memory as a 5.2 TB GPU.

  2. Two Sparks (256 GB combined UMA) aren't enough for Behemoth at full precision. A 50/50 layer split would put ~115 GB on each Spark — but Ray's actor overhead + Phase-3 calibration activation buffers push that over the 128 GB UMA ceiling. Both attempts at a 2-Spark Behemoth quant hit OOM-killer during calibration.

  3. The fix: add a third node. A 41/41/6-layer split across 2 Sparks + an RTX 3090 (24 GB VRAM, hosted in a Proxmox VM, attached over plain 2.5 GbE LAN) brings each Spark's load down to ~118 GB and ~115 GB with comfortable headroom, while the 3090 handles the tail 6 layers + lm_head + norm in ~22 GB VRAM. Ray RPC handles cross-node hidden-state passing transparently.

This is the first model produced with the 3-node N-shard variant of the distrib-nvfp4 pipeline. The pipeline is open-source at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0) — same scaffold, with --shard-layers a,b,c for arbitrary N-way splits and automatic memory-sorted node placement so the smallest-VRAM node gets the smallest shard.


The hardware: 2× DGX Spark + 1× RTX 3090

The cluster used to produce this artifact:

Node GPU Memory Role
DX10-01 (GB10 Spark) NVIDIA GB10 (sm_121) 128 GB UMA shard0: layers 0–40 + embed_tokens
DX10-02 (GB10 Spark) NVIDIA GB10 (sm_121) 128 GB UMA shard1: layers 41–81
eGPU host (Proxmox VM) NVIDIA RTX 3090 (sm_86) 24 GB VRAM shard2: layers 82–87 + final norm + lm_head
  • ConnectX-7 200 GbE IB between the two Sparks (Ray RPC over IB)
  • Plain 2.5 GbE LAN between Sparks ↔ eGPU host (Ray RPC over LAN)
  • NFS-shared source weights so all three nodes read from the same path

Total combined compute: ~280 W system draw at the wall during calibration. The 3090 added ~150 W on top.

The fact that heterogeneous GPUs (Ampere alongside Blackwell) participate in the same Ray cluster + modelopt quantization run is interesting in itself — Ampere has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch doesn't matter until inference time. The exported NVFP4 model file is identical to what an all-Blackwell cluster would produce.


Quantization Pipeline (short version)

Each of the three Ray actors owns a contiguous slice of layers and materializes only its own weights via init_empty_weights + selective set_module_tensor_to_device + per-tensor streaming load via safetensors.safe_open (lets a Spark hold ~115 GB BF16 plus Ray overhead inside a 121 GB UMA budget without kernel-OOM).

modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode without running its own forward. The driver routes hidden states between actors in a chain (shard0.forward_first → shard1.forward_middle → shard2.forward_second) over Ray RPC for each calibration sample.

After 256 samples × variable length, each actor finalizes its quantizers, evicts its quantized layers to disk via cloudpickle.dumps (modelopt 0.43's QuantLinear is a dynamically-generated subclass that vanilla pickle can't serialize), then streams per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer-at-a-time template (with use_cache=False to dodge a transformers DynamicCache shape mismatch). The driver gathers per-shard exports (auto-rsync from VM-local disk for nodes off the NFS share), merges per-actor shards into a single HF compressed-tensors model, renames layer indices on shards 1 and 2 with the cumulative offset, copies tokenizer files, patches config.json to keep lm_head BF16, and injects input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).

Calibration health-check passed cleanly on the run that produced this artifact:

  • shard0 (layers 0–40 + embed): good=287, zero=0, nan=0
  • shard1 (layers 41–81): good=287, zero=0, nan=0
  • shard2 (layers 82–87 + norm + lm_head): good=42, zero=0, nan=0

(NVFP4_DEFAULT_CFG inserts 7 quantizers per layer for Mistral arch.)


Performance

Tested on a single DGX Spark (GB10) running vLLM with this NVFP4 model loaded.

Stock vLLM (CUTLASS GEMM, default backend)

Workload Token generation (per stream) Notes
Short prompt, 200 tok output 2.86 tok/s 5-run median, std-dev <1 %
~2.6 K prefill + 200 out 2.25 tok/s single run

Tuned: MARLIN-GEMM + FlashInfer (Avarok stack)

Adding the three env vars and one flag:

VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1

plus --attention-backend flashinfer on the serve command, gives this on the same Spark and model:

Workload Token generation Speedup
Short prompt, 200 tok output 3.21 tok/s +12 %

3 sequential runs all returned 3.21 tok/s (62 247 / 62 254 / 62 256 ms wall) — the MARLIN-GEMM path is essentially deterministic. Behemoth's relative speedup is smaller than e.g. Anubis-Pro-105B (+22 % under the same switch) because at 123B the decode is more memory-bound on KV cache reads — MARLIN's faster GEMM matters less when reading the cache is the bottleneck.

Cold load (vLLM startup, first-request end-to-end from disk): ~430 s (7:10) with stock backend on a single Spark for the 66 GB of NVFP4 shards. MARLIN's first-time kernel JIT compile may add a one-time 30–60 s; cached for subsequent loads.

Stock-bench config: --quantization compressed-tensors --kv-cache-dtype fp8 --max-num-seqs 4 --gpu-memory-utilization 0.90 with vLLM 0.20.2rc1.dev53+g01b9b5af6 and no runtime env-var tuning. Tuned-bench: same plus the env vars and flag above. See Avarok's blog post for background on the MARLIN port.


Usage

vLLM (direct)

Recommended on GB10 — the tuned Spark stack:

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/Behemoth-X-123B-v2.2-NVFP4 \
  --served-model-name Behemoth-X-123B-v2.2-NVFP4 \
  --attention-backend flashinfer \
  --quantization compressed-tensors \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 9006

--gpu-memory-utilization 0.90 for the 66 GB Behemoth NVFP4 leaves ~43 GB KV-cache pool on a 128 GB UMA Spark — enough for 32 K context at max-num-seqs 4. Drop to 0.85 if you don't need the longer context.

llama-swap entry

"Behemoth-X-123B-v2.2-NVFP4":
  proxy: "http://127.0.0.1:9006"
  ttl: 0
  checkEndpoint: "/health"
  env:
    - "VLLM_NVFP4_GEMM_BACKEND=marlin"
    - "VLLM_TEST_FORCE_FP8_MARLIN=1"
    - "VLLM_MARLIN_USE_ATOMIC_ADD=1"
  cmd: >-
    /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
    --model /home/<user>/models/Behemoth-X-123B-v2.2-NVFP4
    --attention-backend flashinfer
    --served-model-name Behemoth-X-123B-v2.2-NVFP4
    --quantization compressed-tensors
    --dtype auto
    --kv-cache-dtype fp8
    --max-model-len 32768
    --max-num-seqs 4
    --gpu-memory-utilization 0.90
    --trust-remote-code
    --enable-chunked-prefill
    --enable-prefix-caching
    --port 9006
    --host 127.0.0.1

Recommended sampling

From TheDrummer's original Behemoth-X-123B-v2.2 card:

  • Chat template: Metharme with Mistral system tokens — [SYSTEM_PROMPT] <|system|>{{system_message}}[/SYSTEM_PROMPT]<|user|>{{user_message}}<|model|>{{assistant_message}}
  • Drummer's KoboldCpp-Frankensampling settings are a good baseline; specifically temp 0.95–1.05, min-p 0.025, smoothing factor ~0.2 work well for the "chaos edition" variant

Files in this repository

  • model-NNNNN-of-00014.safetensors — 14 shards, NVFP4-packed weights + scales (~66 GB total)
  • model.safetensors.index.json — weight map (~2 643 keys: 88 layers × 7 quant linears × 4 keys each + norms + embed + lm_head + injected input_scale keys)
  • config.json — Mistral config with quantization_config.ignore=["lm_head"] and input_activations.dynamic: true
  • hf_quant_config.json, generation_config.json — auxiliary modelopt + generation configs
  • tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json — Mistral tokenizer, untouched from upstream

Recent fixes baked into the conversion

modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage:

  1. Phase-6 requires vocab_size=2 (not 1) on the per-layer template because modelopt's internal llm_dummy_forward feeds torch.ones([1, 2]) into the embedding.
  2. Phase-6 requires pad_token_id=None/bos/eos=None on the template config — modelopt's pad-eos consistency check asserts otherwise.
  3. Phase-6 must NOT clear _calibrator on quantized modules; modelopt's set_quantizer_by_cfg_context.__exit__ AttributeErrors on None.
  4. Per-actor exports omit input_scale keys; vLLM registers an uninitialized Parameter and produces garbage decoding unless input_scale=1.0 is injected for every .weight_scale_2 key.
  5. config.json after merge needs input_activations.dynamic: true (modelopt writes false but emits no static scale — vLLM falls back to a default that doesn't match the quantization).
  6. Merged config must restore num_hidden_layers, vocab_size, and pad/bos/eos token IDs from the source model (Phase-6 used shrunken dummies).

For N-shard mode (this Behemoth release used 3 shards), three additional fixes:

  1. Phase-5.5 disk eviction must use cloudpickle as pickle_module for torch.save — modelopt 0.43's QuantLinear is a dynamically-generated subclass that vanilla pickle can't serialize.
  2. Phase-6 loads the saved layer into a 1-layer template — must force cfg_t.use_cache=False and explicitly reset layer.self_attn.layer_idx = 0 to avoid an IndexError in transformers' DynamicCache (the loaded layer retains its original layer_idx from the full model, e.g. 50, which the 1-slot cache of the template doesn't have).
  3. Phase-6.5 (new) — auto-rsync per-shard exports from any non-NFS-shared actor (e.g. eGPU host VM) back to the driver before merge.

All nine fixes are applied automatically by the pipeline at github.com/KaletoAI/distrib-nvfp4.


Acknowledgments

  • TheDrummer for the original Behemoth-X-123B-v2.2 finetune
  • Avarok-Cybersecurity (tbraun96) for the MARLIN-backend NVFP4 GEMM port that delivers the +12 % short-context speedup measured above
  • saricles for setting the bar on GB10-tuned NVFP4 recipes — this release uses the stock NVFP4_DEFAULT_CFG with only lm_head in the ignore list, NOT the agentic-mix-tuned -GB10 recipe; a future v2 might apply that
  • NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
  • vLLM project for compressed-tensors NVFP4 inference support

License

This NVFP4 quantization inherits the Mistral Research License (MRL) from the base model TheDrummer/Behemoth-X-123B-v2.2, which is itself derived from Mistral-Large-2411. For research, evaluation, and personal non-commercial use only. For commercial deployment, obtain a Mistral commercial license.


Status

Single-author release; first public NVFP4 of a 123B Mistral-Large derivative; first model produced with a heterogeneous 3-node pipeline. Issues and feedback welcome — both on the model artifact (vLLM behaviour, sampling, RP-quality reports) and on the pipeline that built it.

Downloads last month
233
Safetensors
Model size
69B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support