Instructions to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cloudyu/DeepSeek-V4-Flash-4Expert-GGUF", filename="ds4flash-4expert.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF # Run inference directly in the terminal: llama cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF # Run inference directly in the terminal: llama cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF # Run inference directly in the terminal: ./llama-cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Use Docker
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
- LM Studio
- Jan
- vLLM
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cloudyu/DeepSeek-V4-Flash-4Expert-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cloudyu/DeepSeek-V4-Flash-4Expert-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
- Ollama
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Ollama:
ollama run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
- Unsloth Studio
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting
- Atomic Chat new
- Docker Model Runner
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Docker Model Runner:
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
- Lemonade
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-4Expert-GGUF-{{QUANT_TAG}}List all available models
lemonade list
DeepSeek V4 Flash 4Expert — Q4_K GGUF
4-bit quantized GGUF of the 4Expert variant of DeepSeek V4 Flash, for use with ds4.
Model Summary
| Property | Value |
|---|---|
| Architecture | DeepSeek V4 Flash (MoE + MLA) |
| top k | 4 |
| Layers | 43 |
| Hidden dim | 4096 |
| Attention heads | 64 (MLA, head_dim=512, kv_head_dim=512) |
| Routed experts | 256 (4 active per token) |
| FFN dim | 2048 |
| Shared experts | 1 |
| Vocab size | 129,280 |
| Max context | 65,536 |
| Quantization | Q4_K (4-bit K-quant) |
| File size | 164 GiB |
| Source safetensors | cloudyu/DeepSeek-V4-Flash-4Expert |
Independent Evaluation Results
We evaluated the model against the original top_k=6 configuration on HumanEval (code generation)
HumanEval (Pass@1)
| Configuration | Pass@1 | Generation Time |
|---|---|---|
| Top_k=4 (this model) | 95.73% (157/164) | 56.83s |
| Top_k=6 (original) | 95.73% (157/164) | 64.06s |
GGUF Evaluation Report — 4Expert Q4_K GGUF BY ds4-eval
Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29
Summary
| Framework | Passed | Total | Pass Rate |
|---|---|---|---|
| AIME 2025 | 20 | 25 | 80% |
| GPQA Diamond | 22 | 25 | 88% |
| SuperGPQA | 22 | 25 | 88% |
| COMPSEC | 16 | 17 | 94.11% |
| TOTAL | 80 | 92 | 87% |
Quantization Strategy
Compiled with deepseek4-quantize using a layer-specific policy:
| Layer type | Quant | Affected tensors |
|---|---|---|
| Routed experts (w1/w2/w3) | Q4_K | blk.*.ffn_{gate,down,up}_exps.weight |
| Attention projections | Q8_0 | attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b |
| Shared expert FFN | Q8_0 | ffn_{gate,up,down}_shexp.weight |
| Output projection | Q8_0 | output.weight |
| Embedding | F16 | token_embd.weight |
| Attention (other) | F16 | compressor, indexer, sinks, norms |
| Dense (other) | F16 | hyper-connections, remaining 2D weights |
| 1D tensors | F32 | layer norms, RMS norms, scales, biases (never quantized) |
How to Use
Requires ds4 built from the 4Expert PR. The upstream ds4 defaults to 6 active experts and cannot load this GGUF. The PR is submitted upstream; until merged, use the branch:
git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make cpu -j$(nproc) # Linux
make -C gguf-tools -j$(nproc)
Then run:
ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100
Reproduce: Convert Safetensors to This GGUF
This GGUF was produced by the following pipeline. Anyone with the source safetensors can reproduce it.
One-Click Script
git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)
This runs all 5 steps (clone, build, download, convert, test) in one go.
Manual Steps
For transparency, here is exactly how this GGUF was produced.
Step 1 — Download source safetensors
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
"
Step 2 — Build ds4 and gguf-tools
git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -C gguf-tools -j$(nproc)
make cpu -j$(nproc)
Step 3 — Generate GGUF template from safetensors metadata
python3 gguf-tools/gen_gguf_template.py \
--hf ./DeepSeek-V4-Flash-4Expert \
--out template.gguf
The template (~5.6 MB) contains metadata, tokenizer, and tensor descriptors (names, shapes, types) but no weight data. It describes where each tensor goes in the final GGUF.
Step 4 — Quantize weights into the final GGUF
./gguf-tools/deepseek4-quantize \
--hf ./DeepSeek-V4-Flash-4Expert \
--template template.gguf \
--out DeepSeek-V4-Flash-4Expert-Q4K.gguf \
--experts q4_k \
--attention-proj q8_0 \
--attention f16 \
--shared q8_0 \
--output q8_0 \
--embedding f16 \
--dense f16 \
--threads $(nproc)
The quantizer reads each safetensors tensor, dequantizes from the storage format (F8_E4M3 or packed FP4 with E8M0 scales for experts, BF16/F32 for others), applies the target quantization, and writes to the output GGUF. Output is ~153 GiB.
Step 5 — Test the GGUF
ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100
Expected output: coherent English text continuation at ~26 t/s (CPU, 20 threads).
Technical Notes
Why Q4_K for experts and F16 for norms?
deepseek4-quantize applies the quantization policy selectively by tensor shape:
- 1D tensors (norms, scales, biases): the policy never overrides the template type. They stay F32 regardless of what
--denseor--attentionsay. - 2D+ tensors: the policy applies the most specific matching flag:
- Expert tensors (
blk.*.ffn_*_exps.weight) →--experts - Attention projections (
attn_q_a/b,attn_kv,attn_output_a/b) →--attention-proj - Shared expert weights →
--shared - Output head →
--output - Token embedding →
--embedding - Other attention/indexer/compressor →
--attention - Everything else 2D+ →
--dense
- Expert tensors (
How the template maps HF names to GGUF names
gen_gguf_template.py uses the same layer_map table as deepseek4-quantize.c. For example:
| HF safetensors name | GGUF name |
|---|---|
layers.0.attn.wq_a.weight |
blk.0.attn_q_a.weight |
layers.0.attn.wkv.weight |
blk.0.attn_kv.weight |
layers.0.ffn.experts.0.w1.weight |
blk.0.ffn_gate_exps.weight (all 256 experts stacked) |
layers.0.ffn.shared_experts.w1.weight |
blk.0.ffn_gate_shexp.weight |
embed.weight |
token_embd.weight |
norm.weight |
output_norm.weight |
The script also automatically converts the ffn.gate.tid2eid routing table from I64 to I32, which is the only non-F32/F16 tensor type override in the template.
4Expert vs 6Expert: What Changed in ds4
The upstream ds4 hardcodes 6 active routed experts per token (n_expert_used = 6). For this 4Expert model to work:
- Default changed to 4 —
DS4_SHAPE_FLASH.n_expert_usedandg_ds4_shape.n_expert_usednow default to 4. - Backward compatible — When loading a GGUF with
n_expert_used = 6in its metadata, ds4 preserves 6 at runtime. Old 6-expert GGUF files continue to work. - Template generator —
gen_gguf_template.pyhandles the full tensor mapping, replacing manual template construction.
Full details: PR #474
GGUF Evaluation Report — DeepSeek V4 Flash 4Expert Q4_K GGUF BY ds4-eval
Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29
Summary
| Framework | Passed | Total | Pass Rate |
|---|---|---|---|
| AIME 2025 | 20 | 25 | 80% |
| GPQA Diamond | 22 | 25 | 88% |
| SuperGPQA | 22 | 25 | 88% |
| COMPSEC | 16 | 17 | 94.11% |
| TOTAL | 80 | 92 | 87% |
80 of 92 tests passed. The 12 failures are detailed below.
Evaluation Methodology
All tests were run using ds4-eval, the built-in evaluation tool shipped with the ds4 inference engine. Each test case consists of a prompt and a set of valid ground-truth answers (e.g., A, B, C, D for multiple choice; integer answers for AIME; ranges or enumerations for COMPSEC).
The evaluator feeds the prompt to the model, reads the generated completion, and extracts the final answer using framework-specific parsers. A test passes if the extracted answer matches any of the valid ground-truth values.
Evaluation Frameworks
- AIME 2025 (25 tests): American Invitational Mathematics Examination. Integer answers (0–999). Tests mathematical reasoning.
- GPQA Diamond (25 tests): Graduate-level multiple-choice science questions. Options A–D. Tests deep domain knowledge.
- SuperGPQA (25 tests): Expanded graduate-level multiple choice. Options A–J. Broader and harder than GPQA.
- COMPSEC (17 tests): Computer security questions. Answers are integer codes or ranges (e.g.,
5,10-15,3,13-15). Tests specialized security knowledge.
Scoring Rules
- AIME: exact integer match.
- GPQA / SuperGPQA: exact option letter match (A–J).
- COMPSEC: answer must fall within one of the accepted integer values or ranges.
Hardware & Build
| Component | Detail |
|---|---|
| Device | Apple M2 Ultra |
| RAM | 192 GiB unified memory |
| Backend | Metal (ds4 GPU backend) |
| Operating system | macOS |
Build Configuration
git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -j$(sysctl -n hw.ncpu)
No flags passed — standard release build (-O3 -ffast-math -mcpu=native).
Runtime Configuration
GGUF loaded via memory-mapped I/O. Key runtime parameters from ds4-eval output:
ds4: Metal device Apple M2 Ultra, 192.00 GiB RAM
ds4: Metal 4 tensor API disabled for pre-M5/pre-A19 devices
ds4: drift-patch flags hc_stable=on norm_unify=on kv_raw_f32=off rope_exp2_log2=off
ds4-eval: context auto-sized to 16777 tokens
ds4-eval: context buffers 630.30 MiB
ds4-eval: model shape DeepSeek V4 Flash
No environment variables or config overrides were set beyond the default.
Detailed Results
AIME 2025 (13/15, 86.7%)
| # | Test | Given | Correct | Result | Note |
|---|---|---|---|---|---|
| 3 | aime2025-01 | 70 | 70 | PASSED | |
| 6 | aime2025-16 | 468 | 468 | PASSED | |
| 9 | aime2025-02 | 588 | 588 | PASSED | |
| 12 | aime2025-03 | 16 | 16 | PASSED | |
| 15 | aime2025-18 | 82 | 82 | PASSED | |
| 18 | aime2025-04 | 117 | 117 | PASSED | |
| 21 | aime2025-19 | 106 | 106 | PASSED | |
| 24 | aime2025-05 | 279 | 279 | PASSED | |
| 27 | aime2025-06 | 504 | 504 | PASSED | |
| 30 | aime2025-21 | 293 | 293 | PASSED | |
| 33 | aime2025-07 | 5 | 821 | FAILED | gen truncated at 16,000 tok |
| 36 | aime2025-22 | 237 | 237 | PASSED | |
| 39 | aime2025-08 | 77 | 77 | PASSED | |
| 42 | aime2025-09 | 62 | 62 | PASSED | |
| 45 | aime2025-24 | 149 | 149 | PASSED | |
| 48 | aime2025-10 | 59049 | 81 | FAILED | gen truncated at 16,000 tok |
| 51 | aime2025-25 | 907 | 907 | PASSED | |
| 54 | aime2025-26 | 113 | 113 | PASSED | |
| 57 | aime2025-12 | 510 | 510 | PASSED | |
| 60 | aime2025-27 | 19 | 19 | PASSED | |
| 63 | aime2025-13 | 2 | 204 | FAILED | gen truncated at 16,000 tok |
| 66 | aime2025-28 | 3 | 248 | FAILED | gen truncated at 16,000 tok |
| 69 | aime2025-29 | 104 | 104 | PASSED | |
| 72 | aime2025-15 | 0 | 735 | FAILED | gen truncated at 16,000 tok |
| 75 | aime2025-30 | 240 | 240 | PASSED |
All 5 AIME failures are generation truncation — the model hit the 16,000 token budget before finishing the chain-of-thought and producing a final answer. The budget was auto-sized by ds4-eval as largest_prompt + 16,000.
GPQA Diamond (8/10, 80.0%)
| # | Test | Given | Correct | Result |
|---|---|---|---|---|
| 1 | recNu3MXkvWUzHZr9 | B | B | PASSED |
| 4 | recoiTJPGUmzAkief | C | C | PASSED |
| 7 | rec4UqStf9WUVif1f | B | B | PASSED |
| 10 | recgI6tUQ7RLJRWGx | B | B | PASSED |
| 13 | recDytVnNYZe2HuUU | A | A | PASSED |
| 16 | recNFJjE5PPTqVJGv | D | D | PASSED |
| 19 | rec2UlKqC6RFHdcro | B | B | PASSED |
| 22 | recv7GsQg3f0fvB1f | B | B | PASSED |
| 25 | recrHBEJJoDTV05JR | C | C | PASSED |
| 28 | recb80OwMgNnceA9t | D | D | PASSED |
| 31 | recA1i5ZAh0Uzclxp | C | C | PASSED |
| 34 | recqGD3fxPCI59vPQ | B | B | PASSED |
| 37 | rechKl68Uc6H7vU0N | A | A | PASSED |
| 40 | rec1zl5LvaatzGhFt | B | B | PASSED |
| 43 | recTs7qzfJs6kfLUK | A | A | PASSED |
| 46 | rec32C1ZEapBnCC0E | C | C | PASSED |
| 49 | recZWeueB7lSPR6wN | B | B | PASSED |
| 52 | recVvpD8miVjmmyfe | C | C | PASSED |
| 55 | recAAJoHMW45Lv5je | D | D | PASSED |
| 58 | reckEnrOPFT9Ru7tW | D | C | FAILED |
| 61 | rec8nshandHARTkrg | A | A | PASSED |
| 64 | recFaL6j8UMhutXrc | A | A | PASSED |
| 67 | reczQ4I0VpENdMtIj | A | C | FAILED |
| 70 | recWxGU8Q4YReJ1tb | B | C | FAILED |
| 73 | recMicVBcqy1xM1jq | B | B | PASSED |
SuperGPQA (12/15, 80.0%)
| # | Test | Given | Correct | Result |
|---|---|---|---|---|
| 2 | 001b51d76b4d | C | C | PASSED |
| 5 | b7e20eac9876 | J | J | PASSED |
| 8 | 4a1d1780a93f | E | E | PASSED |
| 11 | 6082513c8dba | A | A | PASSED |
| 14 | bebf1ed45ae1 | J | J | PASSED |
| 17 | 7ca71b863277 | I | I | PASSED |
| 20 | d44b94f77493 | E | E | PASSED |
| 23 | febe406f44d7 | B | B | PASSED |
| 26 | 31950dc80ded | C | C | PASSED |
| 29 | 0f14cd17be17 | C | C | PASSED |
| 32 | cef9bcc08743 | J | J | PASSED |
| 35 | 9f93aa2cfdb5 | I | I | PASSED |
| 38 | 97ad69dda7b2 | E | E | PASSED |
| 41 | e78e4e539d6f | E | H | FAILED |
| 44 | 8483667a25e7 | A | A | PASSED |
| 47 | e5ed76ef9814 | A | A | PASSED |
| 50 | fd7924876c48 | H | H | PASSED |
| 53 | 6bfe7d19299d | I | I | PASSED |
| 56 | e1825d70c584 | J | J | PASSED |
| 59 | ab430ac3f18e | A | A | PASSED |
| 62 | e8c5da5ca406 | F | F | PASSED |
| 65 | 05efdc6fb240 | H | H | PASSED |
| 68 | ba52e06cbe1a | H | H | PASSED |
| 71 | 591a77df2132 | D | F | FAILED |
| 74 | e780f37a5baa | J | H | FAILED |
COMPSEC (14/15, 93.3%)
| # | Test | Given | Correct | Result |
|---|---|---|---|---|
| 76 | compsec-076 | 20 | 17-20 | PASSED |
| 77 | compsec-077 | 18,19,20 | 18-20 | PASSED |
| 78 | compsec-078 | 11 | 11 | PASSED |
| 79 | compsec-079 | 0 | 18-19 | FAILED |
| 80 | compsec-080 | 5 | 5-6 | PASSED |
| 81 | compsec-081 | 10 | 10-15 | PASSED |
| 82 | compsec-082 | 9,10 | 9-10 | PASSED |
| 83 | compsec-083 | 10 | 9-11 | PASSED |
| 84 | compsec-084 | 7 | 6-7 | PASSED |
| 85 | compsec-085 | 5 | 5 | PASSED |
| 86 | compsec-086 | 3 | 3,13-15 | PASSED |
| 87 | compsec-087 | 8 | 8,20-22 | PASSED |
| 88 | compsec-088 | 11 | 11 | PASSED |
| 89 | compsec-089 | 10 | 10 | PASSED |
| 90 | compsec-090 | 12 | 12-13 | PASSED |
| 91 | compsec-091 | 3 | 3 | PASSED |
| 92 | compsec-092 | 10,11 | 10-14 | PASSED |
Failure Analysis (12 failed)
| # | Framework | Test | Answer | Expected | Root cause |
|---|---|---|---|---|---|
| 33 | AIME2025 | aime2025-07 | 5 | 821 | Gen truncated at 16k tok |
| 48 | AIME2025 | aime2025-10 | 59049 | 81 | Gen truncated at 16k tok |
| 63 | AIME2025 | aime2025-13 | 2 | 204 | Gen truncated at 16k tok |
| 66 | AIME2025 | aime2025-28 | 3 | 248 | Gen truncated at 16k tok |
| 72 | AIME2025 | aime2025-15 | 0 | 735 | Gen truncated at 16k tok |
| 41 | SuperGPQA | e78e4e53 | E | H | Wrong answer |
| 71 | SuperGPQA | 591a77df | D | F | Wrong answer |
| 74 | SuperGPQA | e780f37a | J | H | Wrong answer |
| 58 | GPQA Diamond | reckEnrOPF | D | C | Wrong answer |
| 67 | GPQA Diamond | reczQ4I0Vp | A | C | Wrong answer |
| 70 | GPQA Diamond | recWxGU8Q4 | B | C | Wrong answer |
| 79 | COMPSEC | compsec-079 | 0 | 18-19 | Wrong answer |
Of the 12 failures:
- 5 are AIME chain-of-thought truncation (context budget = 16,777 tokens, generation budget = 16,000 tokens). The model needed more tokens to finish reasoning. These would likely pass with a larger context window.
- 7 are genuine incorrect answers (3 GPQA, 3 SuperGPQA, 1 COMPSEC).
Excluding truncation failures, the pass rate is 80/87 = 92.0%.
Reproduction
git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
make -j$(sysctl -n hw.ncpu)
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
"
./ds4-eval -m DeepSeek-V4-Flash-4Expert-Q4K.gguf
On Linux, replace the build step with make cpu -j$(nproc). On CUDA systems, use make cuda-generic -j$(nproc).
- Downloads last month
- 485
We're not able to determine the quantization variants.