DeepSeek-R1-Distill-Llama-70B-NVFP4

NVFP4 quantized version of DeepSeek-R1-Distill-Llama-70B using custom Blackwell NVFP4 GEMM kernels.

140 GB → 40 GB (0.29x) with vision tower excluded.

NVFP4 Quantization Details

Property	Value
Base model	deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Quantization	NVFP4 (W4A4 — weights FP4 E2M1, activations FP4, scales FP8 E4M3)
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor v0.10.0.2
Calibration	128 samples, `ultrachat-200k` (train_sft split), max_seq_length 2048
Size	40 GB (single safetensors shard set)
Requires	NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head]
  scheme: NVFP4

Usage

vLLM

vllm serve PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --host 0.0.0.0 \
  --port 8081 \
  --max-model-len 8192

Python

from vllm import LLM, SamplingParams

llm = LLM(model="PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4")
output = llm.generate("What is the meaning of life?", SamplingParams(max_tokens=256))
print(output[0].outputs[0].text)

Benchmarks

Tested on RTX PRO 6000 Blackwell 96GB:

Backend	Generation tok/s	Prompt tok/s
vLLM 0.19.0	25.0	176.3
llama.cpp (GGUF variant)	33.6	196.5

GGUF Version

A GGUF version of this model is available at PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4-GGUF for use with llama.cpp.

Credits

Quantized by PiehSoft (William Pieh) on NVIDIA RTX PRO 6000 Blackwell 96GB.

Downloads last month: 19

Safetensors

Model size

41B params

Tensor type

F32

BF16

F8_E4M3

Model tree for PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Quantized

(65)

this model