Instructions to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4

SGLang

How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4
```

DeepSeek-R1-Distill-Llama-70B — NVFP4 (compressed-tensors)

Built with Llama.

NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, produced via a distributed 2-node pipeline on NVIDIA DGX Spark (GB10) hardware.

To my knowledge this is the first publicly available NVFP4 of the DeepSeek-R1-Distill-Llama-70B base — the top non-RP reasoning model in the 70B class with ~4.5 M downloads on the original.

Quick facts


Base model	deepseek-ai/DeepSeek-R1-Distill-Llama-70B (Llama-3.3-70B distilled from R1)
Architecture	LlamaForCausalLM, 80 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128
Original size	~132 GB (BF16)
Quantized size	~40 GB (see Files tab)
Quant format	NVFP4 via nvidia-modelopt 0.43.0
Storage layout	compressed-tensors (vLLM-native)
lm_head	Kept BF16 (unquantized), in `quantization_config.ignore`
KV cache	Configurable at serve time (FP8 recommended)
Calibration data	256 samples from `cnn_dailymail`, lengths 150–1200 tokens
Conversion date	2026-05-15

Why this exists

DeepSeek-R1-Distill-Llama-70B is the most-downloaded non-RP reasoning model in the 70B-class (4.5 M downloads on the original), and until now had no public NVFP4 quantization despite being a perfect target — Llama-3.3 architecture, 70B fits cleanly on a single 128 GB UMA DGX Spark in NVFP4 with massive KV-cache headroom for long reasoning chains.

This release closes that gap with a production-quality 256-sample calibration run on a 2-Spark Ray cluster, using the same pipeline that produced Anubis-Pro-105B-NVFP4 and Behemoth-X-123B-v2.2-NVFP4 — open at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0).

For 70B-class models the distributed pipeline is honestly overkill (the model fits on one Spark for quantization too), but it's the same toolchain so reusing it is free. The benefit: identical workflow, identical fix-list applied, identical reproducibility as the larger releases.

Quantization Pipeline (short version)

Two Ray actors own 40 layers each. modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode without running its own forward; the driver routes hidden states between actors via Ray RPC for each of 256 calibration samples.

After finalize, per-actor disk-eviction (cloudpickle for modelopt's dynamic QuantLinear), then streaming per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer template (with use_cache=False). Driver merges per-actor shards, renames layer indices on shard 1 with the +40 offset, copies tokenizer (DeepSeek uses tiktoken BPE — no tokenizer.model file), patches config.json to keep lm_head BF16, and injects input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).

Calibration health on the run that produced this artifact:

shard0 (layers 0–39 + embed): good=280, zero=0, nan=0
shard1 (layers 40–79 + norm + lm_head): good=280, zero=0, nan=0

(NVFP4_DEFAULT_CFG inserts 7 quantizers per layer for Llama arch.)

Total pipeline time: 25 min on 2× DGX Spark (IB-connected at 10.20.0.x). Load 3 min, calibrate ~15 min, eviction 105 s, export 110 s, merge 25 s.

Performance

Stock-vLLM bench will follow as separate update; pattern is consistent with the related Anubis-Pro and Behemoth releases:

Anubis-Pro-105B-NVFP4 (for reference):

Stock vLLM: ~3.1 tok/s decode short context
MARLIN+FlashInfer: 3.78 tok/s (+22 %)

DeepSeek-R1-Distill-Llama-70B-NVFP4 (this model):

Expected to be faster than both Anubis (105B) and Behemoth (123B) due to smaller size
Estimate ~4.5–5.5 tok/s decode on the MARLIN+FlashInfer stack
Will measure and update once the model is benched on Spark

For reasoning workloads (long chain-of-thought outputs) on a single Spark, this model is the sweet spot — 70B class, fits with ample KV-cache pool, and the NVFP4 quality preservation at W4A4 retains the R1-distilled reasoning behaviour.

Usage

vLLM (direct)

Recommended on GB10 — the tuned Spark stack with MARLIN GEMM + FlashInfer attention:

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --attention-backend flashinfer \
  --quantization compressed-tensors \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.80 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 9007

--gpu-memory-utilization 0.80 for the 40 GB DeepSeek NVFP4 leaves ~62 GB of KV-cache pool on a 128 GB UMA Spark — long enough for 32 K context with max-num-seqs 4 and a healthy chain-of-thought reasoning buffer. Bump to 0.85 if you want more concurrency.

llama-swap entry

"DeepSeek-R1-Distill-Llama-70B-NVFP4":
  proxy: "http://127.0.0.1:9007"
  ttl: 0
  checkEndpoint: "/health"
  env:
    - "VLLM_NVFP4_GEMM_BACKEND=marlin"
    - "VLLM_TEST_FORCE_FP8_MARLIN=1"
    - "VLLM_MARLIN_USE_ATOMIC_ADD=1"
  cmd: >-
    /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
    --model /home/<user>/models/DeepSeek-R1-Distill-Llama-70B-NVFP4
    --attention-backend flashinfer
    --served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4
    --quantization compressed-tensors
    --dtype auto
    --kv-cache-dtype fp8
    --max-model-len 32768
    --max-num-seqs 4
    --gpu-memory-utilization 0.80
    --trust-remote-code
    --enable-chunked-prefill
    --enable-prefix-caching
    --port 9007
    --host 127.0.0.1

Recommended sampling (from DeepSeek's original card)

R1-distilled models perform best with:

temperature: 0.6
top_p: 0.95
Avoid system prompts — DeepSeek-R1 family expects user-first conversation flow
For reasoning tasks: let the <think>...</think> block grow uncapped; set --max-tokens high (4096+)

Files in this repository

model-NNNNN-of-00008.safetensors — 8 shards, NVFP4-packed weights + scales (~40 GB total)
model.safetensors.index.json — weight map (~2 403 keys: 80 layers × 7 quant linears × 4 keys + norms + embed + lm_head + injected input_scale)
config.json — Llama config with quantization_config.ignore=["lm_head"] and input_activations.dynamic: true
hf_quant_config.json, generation_config.json — auxiliary configs
tokenizer.json, tokenizer_config.json — DeepSeek tokenizer (tiktoken BPE; no tokenizer.model file)

Recent fixes baked into the conversion

modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage. All applied automatically by the pipeline:

Phase-6 1-layer template needs vocab_size=2 (not 1) because modelopt's llm_dummy_forward feeds torch.ones([1, 2]).
Phase-6 template needs pad_token_id=None/bos/eos=None — pad-eos consistency assertion otherwise.
Phase-6 must NOT clear _calibrator on quantized modules.
Per-actor exports omit input_scale keys; vLLM produces garbage decoding unless input_scale=1.0 is injected per .weight_scale_2 key.
Merged config.json needs input_activations.dynamic: true (modelopt writes false but emits no static scale).
Merged config must restore num_hidden_layers, vocab_size, pad/bos/eos token IDs from source.

(Plus three N-shard-specific fixes for the 3-shard Behemoth release — not exercised here since DeepSeek is 2-shard.)

Acknowledgments

DeepSeek-AI for the original R1-Distill-Llama-70B
Avarok-Cybersecurity (tbraun96) for the MARLIN-backend NVFP4 GEMM port — drives the ~+22 % decode speedup on Spark
entrpi / antirez for the parallel hybrid-quant work on the MoE side of the Spark ecosystem (DeepSeek-V4-Flash) — different recipe, same Spark constraints
saricles for setting the bar on GB10-tuned NVFP4 calibration recipes
NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
vLLM project for compressed-tensors NVFP4 inference support

License

MIT, inherited from deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Pipeline code under Apache 2.0 at github.com/KaletoAI/distrib-nvfp4.

Status

Single-author release. Issues + feedback welcome — both on the model artifact and on the pipeline that built it.

Downloads last month: 128

Safetensors

Model size

41B params

Tensor type

BF16

F8_E4M3

Model tree for Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Quantized

(65)

this model