DeepSeek-R1-Distill-Llama-70B — NVFP4 (compressed-tensors)

Built with Llama.

NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, produced via a distributed 2-node pipeline on NVIDIA DGX Spark (GB10) hardware.

To my knowledge this is the first publicly available NVFP4 of the DeepSeek-R1-Distill-Llama-70B base — the top non-RP reasoning model in the 70B class with ~4.5 M downloads on the original.


Quick facts

Base model deepseek-ai/DeepSeek-R1-Distill-Llama-70B (Llama-3.3-70B distilled from R1)
Architecture LlamaForCausalLM, 80 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128
Original size ~132 GB (BF16)
Quantized size ~40 GB (see Files tab)
Quant format NVFP4 via nvidia-modelopt 0.43.0
Storage layout compressed-tensors (vLLM-native)
lm_head Kept BF16 (unquantized), in quantization_config.ignore
KV cache Configurable at serve time (FP8 recommended)
Calibration data 256 samples from cnn_dailymail, lengths 150–1200 tokens
Conversion date 2026-05-15

Why this exists

DeepSeek-R1-Distill-Llama-70B is the most-downloaded non-RP reasoning model in the 70B-class (4.5 M downloads on the original), and until now had no public NVFP4 quantization despite being a perfect target — Llama-3.3 architecture, 70B fits cleanly on a single 128 GB UMA DGX Spark in NVFP4 with massive KV-cache headroom for long reasoning chains.

This release closes that gap with a production-quality 256-sample calibration run on a 2-Spark Ray cluster, using the same pipeline that produced Anubis-Pro-105B-NVFP4 and Behemoth-X-123B-v2.2-NVFP4 — open at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0).

For 70B-class models the distributed pipeline is honestly overkill (the model fits on one Spark for quantization too), but it's the same toolchain so reusing it is free. The benefit: identical workflow, identical fix-list applied, identical reproducibility as the larger releases.


Quantization Pipeline (short version)

Two Ray actors own 40 layers each. modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode without running its own forward; the driver routes hidden states between actors via Ray RPC for each of 256 calibration samples.

After finalize, per-actor disk-eviction (cloudpickle for modelopt's dynamic QuantLinear), then streaming per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer template (with use_cache=False). Driver merges per-actor shards, renames layer indices on shard 1 with the +40 offset, copies tokenizer (DeepSeek uses tiktoken BPE — no tokenizer.model file), patches config.json to keep lm_head BF16, and injects input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).

Calibration health on the run that produced this artifact:

  • shard0 (layers 0–39 + embed): good=280, zero=0, nan=0
  • shard1 (layers 40–79 + norm + lm_head): good=280, zero=0, nan=0

(NVFP4_DEFAULT_CFG inserts 7 quantizers per layer for Llama arch.)

Total pipeline time: 25 min on 2× DGX Spark (IB-connected at 10.20.0.x). Load 3 min, calibrate ~15 min, eviction 105 s, export 110 s, merge 25 s.


Performance

Stock-vLLM bench will follow as separate update; pattern is consistent with the related Anubis-Pro and Behemoth releases:

Anubis-Pro-105B-NVFP4 (for reference):

  • Stock vLLM: ~3.1 tok/s decode short context
  • MARLIN+FlashInfer: 3.78 tok/s (+22 %)

DeepSeek-R1-Distill-Llama-70B-NVFP4 (this model):

  • Expected to be faster than both Anubis (105B) and Behemoth (123B) due to smaller size
  • Estimate ~4.5–5.5 tok/s decode on the MARLIN+FlashInfer stack
  • Will measure and update once the model is benched on Spark

For reasoning workloads (long chain-of-thought outputs) on a single Spark, this model is the sweet spot — 70B class, fits with ample KV-cache pool, and the NVFP4 quality preservation at W4A4 retains the R1-distilled reasoning behaviour.


Usage

vLLM (direct)

Recommended on GB10 — the tuned Spark stack with MARLIN GEMM + FlashInfer attention:

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --attention-backend flashinfer \
  --quantization compressed-tensors \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.80 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 9007

--gpu-memory-utilization 0.80 for the 40 GB DeepSeek NVFP4 leaves ~62 GB of KV-cache pool on a 128 GB UMA Spark — long enough for 32 K context with max-num-seqs 4 and a healthy chain-of-thought reasoning buffer. Bump to 0.85 if you want more concurrency.

llama-swap entry

"DeepSeek-R1-Distill-Llama-70B-NVFP4":
  proxy: "http://127.0.0.1:9007"
  ttl: 0
  checkEndpoint: "/health"
  env:
    - "VLLM_NVFP4_GEMM_BACKEND=marlin"
    - "VLLM_TEST_FORCE_FP8_MARLIN=1"
    - "VLLM_MARLIN_USE_ATOMIC_ADD=1"
  cmd: >-
    /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
    --model /home/<user>/models/DeepSeek-R1-Distill-Llama-70B-NVFP4
    --attention-backend flashinfer
    --served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4
    --quantization compressed-tensors
    --dtype auto
    --kv-cache-dtype fp8
    --max-model-len 32768
    --max-num-seqs 4
    --gpu-memory-utilization 0.80
    --trust-remote-code
    --enable-chunked-prefill
    --enable-prefix-caching
    --port 9007
    --host 127.0.0.1

Recommended sampling (from DeepSeek's original card)

R1-distilled models perform best with:

  • temperature: 0.6
  • top_p: 0.95
  • Avoid system prompts — DeepSeek-R1 family expects user-first conversation flow
  • For reasoning tasks: let the <think>...</think> block grow uncapped; set --max-tokens high (4096+)

Files in this repository

  • model-NNNNN-of-00008.safetensors — 8 shards, NVFP4-packed weights + scales (~40 GB total)
  • model.safetensors.index.json — weight map (~2 403 keys: 80 layers × 7 quant linears × 4 keys + norms + embed + lm_head + injected input_scale)
  • config.json — Llama config with quantization_config.ignore=["lm_head"] and input_activations.dynamic: true
  • hf_quant_config.json, generation_config.json — auxiliary configs
  • tokenizer.json, tokenizer_config.json — DeepSeek tokenizer (tiktoken BPE; no tokenizer.model file)

Recent fixes baked into the conversion

modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage. All applied automatically by the pipeline:

  1. Phase-6 1-layer template needs vocab_size=2 (not 1) because modelopt's llm_dummy_forward feeds torch.ones([1, 2]).
  2. Phase-6 template needs pad_token_id=None/bos/eos=None — pad-eos consistency assertion otherwise.
  3. Phase-6 must NOT clear _calibrator on quantized modules.
  4. Per-actor exports omit input_scale keys; vLLM produces garbage decoding unless input_scale=1.0 is injected per .weight_scale_2 key.
  5. Merged config.json needs input_activations.dynamic: true (modelopt writes false but emits no static scale).
  6. Merged config must restore num_hidden_layers, vocab_size, pad/bos/eos token IDs from source.

(Plus three N-shard-specific fixes for the 3-shard Behemoth release — not exercised here since DeepSeek is 2-shard.)


Acknowledgments

  • DeepSeek-AI for the original R1-Distill-Llama-70B
  • Avarok-Cybersecurity (tbraun96) for the MARLIN-backend NVFP4 GEMM port — drives the ~+22 % decode speedup on Spark
  • entrpi / antirez for the parallel hybrid-quant work on the MoE side of the Spark ecosystem (DeepSeek-V4-Flash) — different recipe, same Spark constraints
  • saricles for setting the bar on GB10-tuned NVFP4 calibration recipes
  • NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
  • vLLM project for compressed-tensors NVFP4 inference support

License

MIT, inherited from deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Pipeline code under Apache 2.0 at github.com/KaletoAI/distrib-nvfp4.


Status

Single-author release. Issues + feedback welcome — both on the model artifact and on the pipeline that built it.

Downloads last month
128
Safetensors
Model size
41B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4

Quantized
(65)
this model