Instructions to use Chunjiang-Intelligence/DeepSeek-v4-Fable with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Chunjiang-Intelligence/DeepSeek-v4-Fable")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable") model = AutoModelForCausalLM.from_pretrained("Chunjiang-Intelligence/DeepSeek-v4-Fable") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Chunjiang-Intelligence/DeepSeek-v4-Fable" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable
- SGLang
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Chunjiang-Intelligence/DeepSeek-v4-Fable" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chunjiang-Intelligence/DeepSeek-v4-Fable", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Chunjiang-Intelligence/DeepSeek-v4-Fable with Docker Model Runner:
docker model run hf.co/Chunjiang-Intelligence/DeepSeek-v4-Fable
Routed expert tensors are half-width — model fails to load (likely a botched export)
Sharing this after trying to run the model end-to-end, in case it saves someone else the time — and happy to be corrected if I've missed something.
TL;DR — As uploaded, the routed-expert FFN tensors are half their required input width, so they don't form a dimensionally-valid DeepSeek-V4 expert and the model won't load. This also lines up with the 149B-vs-"284B" gap between the weights and the card/citation.
Findings
All weights are
bf16. Thequantization_config: {quant_method: fp8}inconfig.jsonlooks stale/inherited — there are noweight_scale_inv(or any fp8/fp4 scale) tensors in the repo.Per-expert shapes here are
w1 [2048, 2048],w3 [2048, 2048],w2 [4096, 1024].For reference,
deepseek-ai/DeepSeek-V4-Flashstores its NVFP4 experts asw1.weight_packed: uint8 [2048, 2048], which unpacks to[2048, 4096](NVFP4 packs 2 values per byte). So the bf16 experts here carry exactly the packed dimensions — they look like the NVFP4 packed tensors written out as bf16 without being unpacked, leaving the hidden-side input halved (2048 instead of 4096).Consequence: gate/up take a 2048-dim input while
hidden_size = 4096, so the expert block isn't dimensionally consistent. A DeepSeek-V4-aware llama.cpp build rejects it at load:
check_tensor_dims: tensor 'blk.0.ffn_gate_exps.weight' has wrong shape;
expected 4096, 2048, 256, got 2048, 2048, 256
- The parameter count corroborates it: full-width experts ≈ 284B (matching the citation title "…284B Sparse MoE…"), half-width ≈ 149B (matching this repo's reported 149.2B).
Also for anyone trying: no tokenizer/modeling files are bundled, so those have to come from the base deepseek-ai/DeepSeek-V4-Flash.
A re-exported, full-width (unpacked) version would likely load fine. Thanks for putting the work out either way.
Feedback has been received. Thanks!