DeepSeek-R1-Distill-Llama-70B-NVFP4

NVFP4 quantized version of DeepSeek-R1-Distill-Llama-70B using custom Blackwell NVFP4 GEMM kernels.

140 GB → 40 GB (0.29x) with vision tower excluded.

NVFP4 Quantization Details

Property Value
Base model deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Quantization NVFP4 (W4A4 — weights FP4 E2M1, activations FP4, scales FP8 E4M3)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor v0.10.0.2
Calibration 128 samples, ultrachat-200k (train_sft split), max_seq_length 2048
Size 40 GB (single safetensors shard set)
Requires NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19

Recipe

QuantizationModifier:
  targets: [Linear]
  ignore: [lm_head]
  scheme: NVFP4

Usage

vLLM

vllm serve PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
  --host 0.0.0.0 \
  --port 8081 \
  --max-model-len 8192

Python

from vllm import LLM, SamplingParams

llm = LLM(model="PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4")
output = llm.generate("What is the meaning of life?", SamplingParams(max_tokens=256))
print(output[0].outputs[0].text)

Benchmarks

Tested on RTX PRO 6000 Blackwell 96GB:

Backend Generation tok/s Prompt tok/s
vLLM 0.19.0 25.0 176.3
llama.cpp (GGUF variant) 33.6 196.5

GGUF Version

A GGUF version of this model is available at PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4-GGUF for use with llama.cpp.

Credits

Quantized by PiehSoft (William Pieh) on NVIDIA RTX PRO 6000 Blackwell 96GB.

Downloads last month
19
Safetensors
Model size
41B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4

Quantized
(65)
this model