Text Generation
Transformers
Safetensors
deepseek_v4
cybersecurity
ctf
autonomous-agent
mixture-of-experts
long-context
reinforcement-learning
grpo
lora
security-research
fp8
Instructions to use Chunjiang-Intelligence/DeepSeek-v4-Fable with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Chunjiang-Intelligence/DeepSeek-v4-Fable")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable") model = AutoModelForCausalLM.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Chunjiang-Intelligence/DeepSeek-v4-Fable" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable
- SGLang
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Docker Model Runner:
docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable
Serving DeepSeek-v4-Fable on RTX PRO 6000 (SM120): checkpoint is BF16 but config declares fp8; compressor fused_wkv_wgate scale KeyError
#9
by dradra0 - opened
Thanks for releasing DeepSeek-v4-Fable! I'm trying to serve it on 8Γ RTX PRO 6000 (SM120) and hit a checkpoint-format question I can't resolve.
Observations
- The published safetensors are all BF16 (
merge_info.json:output_dtype: torch.bfloat16), butconfig.jsonstill carries aquantization_config(fp8,e4m3, block[128,128],scale_fmt: ue8m0) inherited from the base. Loaders read that and try to load the BF16 weights as FP8 β storage/shape errors (e.g.setStorage ... out of bounds). - Removing the stale
quantization_configlets it load as BF16, but the SM120 DeepSeek-V4 kernels (vllm-ds4-sm120, b12x) are FP8-only, so the BF16 forward fails (ColumnParallelLinear has no attribute 'weight_scale_inv'). - So I quantized it offline to FP8 block, matching
sgl-project/DeepSeek-V4-Flash-FP8's layout: per-expertexperts.N.wN.weight(F8_E4M3) +.scale(F32,[rows/128, cols/128]); attnwkv/wq_a/wq_b/wo_a/wo_b+indexer.wq_bquantized;compressor.*/indexer.compressor.*/weights_projkept BF16. - With
ununnilium/vllm-ds4-sm120:20260618+ Triton sparse MLA, experts and the main attention now load fine, but it fails at:KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.weight_scale_inv'
The model fuses compressorwkv+wgateintofused_wkv_wgateand expects a fused block-FP8 scale. Whether I quantize the compressor or leave it BF16, the param isn't inparams_dict. Oddlysgl-project/DeepSeek-V4-Flash-FP8also storescompressor.wkv/wgateas separate BF16 (no scale, no fused tensor), yet is reported to serve β so I'm clearly missing how the CSA compressor is meant to be quantized/registered.
Questions
- Is there an official FP8 (or otherwise SM120-servable) checkpoint of Fable, or the exact conversion/quantization script you used to produce it?
- Specifically, how should the CSA compressor (
wkv/wgate) be quantized and named so the vLLM DeepSeek-V4 loader'sattn.compressor.fused_wkv_wgate.weight_scale_invis satisfied? - Recommended serving command + image for RTX PRO 6000 (SM120)?
Thanks a lot β everything else (quant pipeline, image, format) is in place; this compressor param is the only blocker.