Instructions to use Chunjiang-Intelligence/DeepSeek-v4-Fable with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Chunjiang-Intelligence/DeepSeek-v4-Fable")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable")
model = AutoModelForCausalLM.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Chunjiang-Intelligence/DeepSeek-v4-Fable"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Chunjiang-Intelligence/DeepSeek-v4-Fable",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable

SGLang

How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Chunjiang-Intelligence/DeepSeek-v4-Fable",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Chunjiang-Intelligence/DeepSeek-v4-Fable",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Docker Model Runner:
```
docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable
```

Serving DeepSeek-v4-Fable on RTX PRO 6000 (SM120): checkpoint is BF16 but config declares fp8; compressor fused_wkv_wgate scale KeyError

by dradra0 - opened 3 days ago

Discussion

dradra0

3 days ago

Thanks for releasing DeepSeek-v4-Fable! I'm trying to serve it on 8× RTX PRO 6000 (SM120) and hit a checkpoint-format question I can't resolve.

Observations

The published safetensors are all BF16 (merge_info.json: output_dtype: torch.bfloat16), but config.json still carries a quantization_config (fp8, e4m3, block [128,128], scale_fmt: ue8m0) inherited from the base. Loaders read that and try to load the BF16 weights as FP8 → storage/shape errors (e.g. setStorage ... out of bounds).
Removing the stale quantization_config lets it load as BF16, but the SM120 DeepSeek-V4 kernels (vllm-ds4-sm120, b12x) are FP8-only, so the BF16 forward fails (ColumnParallelLinear has no attribute 'weight_scale_inv').
So I quantized it offline to FP8 block, matching sgl-project/DeepSeek-V4-Flash-FP8's layout: per-expert experts.N.wN.weight (F8_E4M3) + .scale (F32, [rows/128, cols/128]); attn wkv/wq_a/wq_b/wo_a/wo_b + indexer.wq_b quantized; compressor.* / indexer.compressor.* / weights_proj kept BF16.
With ununnilium/vllm-ds4-sm120:20260618 + Triton sparse MLA, experts and the main attention now load fine, but it fails at:
KeyError: 'layers.N.attn.compressor.fused_wkv_wgate.weight_scale_inv'
The model fuses compressor wkv+wgate into fused_wkv_wgate and expects a fused block-FP8 scale. Whether I quantize the compressor or leave it BF16, the param isn't in params_dict. Oddly sgl-project/DeepSeek-V4-Flash-FP8 also stores compressor.wkv/wgate as separate BF16 (no scale, no fused tensor), yet is reported to serve — so I'm clearly missing how the CSA compressor is meant to be quantized/registered.

Questions

Is there an official FP8 (or otherwise SM120-servable) checkpoint of Fable, or the exact conversion/quantization script you used to produce it?
Specifically, how should the CSA compressor (wkv/wgate) be quantized and named so the vLLM DeepSeek-V4 loader's attn.compressor.fused_wkv_wgate.weight_scale_inv is satisfied?
Recommended serving command + image for RTX PRO 6000 (SM120)?

Thanks a lot — everything else (quant pipeline, image, format) is in place; this compressor param is the only blocker.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment