Instructions to use Kaleto/Behemoth-X-123B-v2.2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kaleto/Behemoth-X-123B-v2.2-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kaleto/Behemoth-X-123B-v2.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kaleto/Behemoth-X-123B-v2.2-NVFP4") model = AutoModelForCausalLM.from_pretrained("Kaleto/Behemoth-X-123B-v2.2-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kaleto/Behemoth-X-123B-v2.2-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kaleto/Behemoth-X-123B-v2.2-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Behemoth-X-123B-v2.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kaleto/Behemoth-X-123B-v2.2-NVFP4
- SGLang
How to use Kaleto/Behemoth-X-123B-v2.2-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kaleto/Behemoth-X-123B-v2.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Behemoth-X-123B-v2.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kaleto/Behemoth-X-123B-v2.2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Behemoth-X-123B-v2.2-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Kaleto/Behemoth-X-123B-v2.2-NVFP4 with Docker Model Runner:
docker model run hf.co/Kaleto/Behemoth-X-123B-v2.2-NVFP4
Behemoth-X-123B-v2.2 — NVFP4 (compressed-tensors)
NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of TheDrummer/Behemoth-X-123B-v2.2, produced via a custom 3-node heterogeneous distributed pipeline on a personal 2× NVIDIA DGX Spark + RTX 3090 setup.
To my knowledge this is the first publicly available NVFP4 quantization of a Mistral-Large-derived 123B-class model, and the first to demonstrate a heterogeneous Blackwell-UMA + Ampere-VRAM Ray cluster running modelopt's NVFP4 export pipeline end-to-end.
Quick facts
| Base model | TheDrummer/Behemoth-X-123B-v2.2 (Mistral-Large-2411 finetune) |
| Architecture | MistralForCausalLM, 88 layers, hidden_size=12288, 96 attn heads, 8 KV heads, head_dim=128 |
| Original size | ~228 GB (BF16) |
| Quantized size | ~66 GB (see Files tab) |
| Quant format | NVFP4 via nvidia-modelopt 0.43.0 |
| Storage layout | compressed-tensors (vLLM-native) |
| lm_head | Kept BF16 (unquantized), in quantization_config.ignore |
| KV cache | Configurable at serve time (FP8 recommended) |
| Calibration data | 256 samples from cnn_dailymail, lengths 150–1200 tokens |
| Conversion date | 2026-05-14 |
Why this exists
Quantizing 123B-class models for personal hardware is not as turn-key as it sounds:
Single-node modelopt fails on Spark. The standard
modelopt hf_ptq.pyworkflow silently OOM-kills on the GB10 becauseaccelerate.infer_auto_device_mapmisdetects unified memory as a 5.2 TB GPU.Two Sparks (256 GB combined UMA) aren't enough for Behemoth at full precision. A 50/50 layer split would put ~115 GB on each Spark — but Ray's actor overhead + Phase-3 calibration activation buffers push that over the 128 GB UMA ceiling. Both attempts at a 2-Spark Behemoth quant hit OOM-killer during calibration.
The fix: add a third node. A 41/41/6-layer split across 2 Sparks + an RTX 3090 (24 GB VRAM, hosted in a Proxmox VM, attached over plain 2.5 GbE LAN) brings each Spark's load down to ~118 GB and ~115 GB with comfortable headroom, while the 3090 handles the tail 6 layers + lm_head + norm in ~22 GB VRAM. Ray RPC handles cross-node hidden-state passing transparently.
This is the first model produced with the 3-node N-shard variant of the distrib-nvfp4 pipeline. The pipeline is open-source at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0) — same scaffold, with --shard-layers a,b,c for arbitrary N-way splits and automatic memory-sorted node placement so the smallest-VRAM node gets the smallest shard.
The hardware: 2× DGX Spark + 1× RTX 3090
The cluster used to produce this artifact:
| Node | GPU | Memory | Role |
|---|---|---|---|
| DX10-01 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard0: layers 0–40 + embed_tokens |
| DX10-02 (GB10 Spark) | NVIDIA GB10 (sm_121) | 128 GB UMA | shard1: layers 41–81 |
| eGPU host (Proxmox VM) | NVIDIA RTX 3090 (sm_86) | 24 GB VRAM | shard2: layers 82–87 + final norm + lm_head |
- ConnectX-7 200 GbE IB between the two Sparks (Ray RPC over IB)
- Plain 2.5 GbE LAN between Sparks ↔ eGPU host (Ray RPC over LAN)
- NFS-shared source weights so all three nodes read from the same path
Total combined compute: ~280 W system draw at the wall during calibration. The 3090 added ~150 W on top.
The fact that heterogeneous GPUs (Ampere alongside Blackwell) participate in the same Ray cluster + modelopt quantization run is interesting in itself — Ampere has no native FP4 hardware but only handles BF16 calibration math, so the architecture mismatch doesn't matter until inference time. The exported NVFP4 model file is identical to what an all-Blackwell cluster would produce.
Quantization Pipeline (short version)
Each of the three Ray actors owns a contiguous slice of layers and materializes only its own weights via init_empty_weights + selective set_module_tensor_to_device + per-tensor streaming load via safetensors.safe_open (lets a Spark hold ~115 GB BF16 plus Ray overhead inside a 121 GB UMA budget without kernel-OOM).
modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode without running its own forward. The driver routes hidden states between actors in a chain (shard0.forward_first → shard1.forward_middle → shard2.forward_second) over Ray RPC for each calibration sample.
After 256 samples × variable length, each actor finalizes its quantizers, evicts its quantized layers to disk via cloudpickle.dumps (modelopt 0.43's QuantLinear is a dynamically-generated subclass that vanilla pickle can't serialize), then streams per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer-at-a-time template (with use_cache=False to dodge a transformers DynamicCache shape mismatch). The driver gathers per-shard exports (auto-rsync from VM-local disk for nodes off the NFS share), merges per-actor shards into a single HF compressed-tensors model, renames layer indices on shards 1 and 2 with the cumulative offset, copies tokenizer files, patches config.json to keep lm_head BF16, and injects input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).
Calibration health-check passed cleanly on the run that produced this artifact:
- shard0 (layers 0–40 + embed): good=287, zero=0, nan=0
- shard1 (layers 41–81): good=287, zero=0, nan=0
- shard2 (layers 82–87 + norm + lm_head): good=42, zero=0, nan=0
(NVFP4_DEFAULT_CFG inserts 7 quantizers per layer for Mistral arch.)
Performance
Tested on a single DGX Spark (GB10) running vLLM with this NVFP4 model loaded.
Stock vLLM (CUTLASS GEMM, default backend)
| Workload | Token generation (per stream) | Notes |
|---|---|---|
| Short prompt, 200 tok output | 2.86 tok/s | 5-run median, std-dev <1 % |
| ~2.6 K prefill + 200 out | 2.25 tok/s | single run |
Tuned: MARLIN-GEMM + FlashInfer (Avarok stack)
Adding the three env vars and one flag:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
plus --attention-backend flashinfer on the serve command, gives this on the same Spark and model:
| Workload | Token generation | Speedup |
|---|---|---|
| Short prompt, 200 tok output | 3.21 tok/s | +12 % |
3 sequential runs all returned 3.21 tok/s (62 247 / 62 254 / 62 256 ms wall) — the MARLIN-GEMM path is essentially deterministic. Behemoth's relative speedup is smaller than e.g. Anubis-Pro-105B (+22 % under the same switch) because at 123B the decode is more memory-bound on KV cache reads — MARLIN's faster GEMM matters less when reading the cache is the bottleneck.
Cold load (vLLM startup, first-request end-to-end from disk): ~430 s (7:10) with stock backend on a single Spark for the 66 GB of NVFP4 shards. MARLIN's first-time kernel JIT compile may add a one-time 30–60 s; cached for subsequent loads.
Stock-bench config: --quantization compressed-tensors --kv-cache-dtype fp8 --max-num-seqs 4 --gpu-memory-utilization 0.90 with vLLM 0.20.2rc1.dev53+g01b9b5af6 and no runtime env-var tuning. Tuned-bench: same plus the env vars and flag above. See Avarok's blog post for background on the MARLIN port.
Usage
vLLM (direct)
Recommended on GB10 — the tuned Spark stack:
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/Behemoth-X-123B-v2.2-NVFP4 \
--served-model-name Behemoth-X-123B-v2.2-NVFP4 \
--attention-backend flashinfer \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 9006
--gpu-memory-utilization 0.90 for the 66 GB Behemoth NVFP4 leaves ~43 GB KV-cache pool on a 128 GB UMA Spark — enough for 32 K context at max-num-seqs 4. Drop to 0.85 if you don't need the longer context.
llama-swap entry
"Behemoth-X-123B-v2.2-NVFP4":
proxy: "http://127.0.0.1:9006"
ttl: 0
checkEndpoint: "/health"
env:
- "VLLM_NVFP4_GEMM_BACKEND=marlin"
- "VLLM_TEST_FORCE_FP8_MARLIN=1"
- "VLLM_MARLIN_USE_ATOMIC_ADD=1"
cmd: >-
/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
--model /home/<user>/models/Behemoth-X-123B-v2.2-NVFP4
--attention-backend flashinfer
--served-model-name Behemoth-X-123B-v2.2-NVFP4
--quantization compressed-tensors
--dtype auto
--kv-cache-dtype fp8
--max-model-len 32768
--max-num-seqs 4
--gpu-memory-utilization 0.90
--trust-remote-code
--enable-chunked-prefill
--enable-prefix-caching
--port 9006
--host 127.0.0.1
Recommended sampling
From TheDrummer's original Behemoth-X-123B-v2.2 card:
- Chat template: Metharme with Mistral system tokens —
[SYSTEM_PROMPT] <|system|>{{system_message}}[/SYSTEM_PROMPT]<|user|>{{user_message}}<|model|>{{assistant_message}} - Drummer's
KoboldCpp-Frankensamplingsettings are a good baseline; specifically temp 0.95–1.05, min-p 0.025, smoothing factor ~0.2 work well for the "chaos edition" variant
Files in this repository
model-NNNNN-of-00014.safetensors— 14 shards, NVFP4-packed weights + scales (~66 GB total)model.safetensors.index.json— weight map (~2 643 keys: 88 layers × 7 quant linears × 4 keys each + norms + embed + lm_head + injected input_scale keys)config.json— Mistral config withquantization_config.ignore=["lm_head"]andinput_activations.dynamic: truehf_quant_config.json,generation_config.json— auxiliary modelopt + generation configstokenizer.json,tokenizer.model,tokenizer_config.json,special_tokens_map.json— Mistral tokenizer, untouched from upstream
Recent fixes baked into the conversion
modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage:
Phase-6requiresvocab_size=2(not 1) on the per-layer template because modelopt's internalllm_dummy_forwardfeedstorch.ones([1, 2])into the embedding.Phase-6requirespad_token_id=None/bos/eos=Noneon the template config — modelopt's pad-eos consistency check asserts otherwise.- Phase-6 must NOT clear
_calibratoron quantized modules; modelopt'sset_quantizer_by_cfg_context.__exit__AttributeErrors on None. - Per-actor exports omit
input_scalekeys; vLLM registers an uninitialized Parameter and produces garbage decoding unlessinput_scale=1.0is injected for every.weight_scale_2key. config.jsonafter merge needsinput_activations.dynamic: true(modelopt writesfalsebut emits no static scale — vLLM falls back to a default that doesn't match the quantization).- Merged config must restore
num_hidden_layers,vocab_size, and pad/bos/eos token IDs from the source model (Phase-6 used shrunken dummies).
For N-shard mode (this Behemoth release used 3 shards), three additional fixes:
- Phase-5.5 disk eviction must use
cloudpickleaspickle_modulefortorch.save— modelopt 0.43'sQuantLinearis a dynamically-generated subclass that vanilla pickle can't serialize. - Phase-6 loads the saved layer into a 1-layer template — must force
cfg_t.use_cache=Falseand explicitly resetlayer.self_attn.layer_idx = 0to avoid an IndexError in transformers' DynamicCache (the loaded layer retains its original layer_idx from the full model, e.g. 50, which the 1-slot cache of the template doesn't have). - Phase-6.5 (new) — auto-rsync per-shard exports from any non-NFS-shared actor (e.g. eGPU host VM) back to the driver before merge.
All nine fixes are applied automatically by the pipeline at github.com/KaletoAI/distrib-nvfp4.
Acknowledgments
- TheDrummer for the original Behemoth-X-123B-v2.2 finetune
- Avarok-Cybersecurity (
tbraun96) for the MARLIN-backend NVFP4 GEMM port that delivers the +12 % short-context speedup measured above - saricles for setting the bar on GB10-tuned NVFP4 recipes — this release uses the stock
NVFP4_DEFAULT_CFGwith onlylm_headin the ignore list, NOT the agentic-mix-tuned-GB10recipe; a future v2 might apply that - NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
- vLLM project for compressed-tensors NVFP4 inference support
License
This NVFP4 quantization inherits the Mistral Research License (MRL) from the base model TheDrummer/Behemoth-X-123B-v2.2, which is itself derived from Mistral-Large-2411. For research, evaluation, and personal non-commercial use only. For commercial deployment, obtain a Mistral commercial license.
- Full Mistral Research License text: https://mistral.ai/licenses/MRL-0.1.md
- Pipeline code (Apache 2.0): https://github.com/KaletoAI/distrib-nvfp4
Status
Single-author release; first public NVFP4 of a 123B Mistral-Large derivative; first model produced with a heterogeneous 3-node pipeline. Issues and feedback welcome — both on the model artifact (vLLM behaviour, sampling, RP-quality reports) and on the pipeline that built it.
- Downloads last month
- 233