DeepSeek V4 Flash 4Expert — Q4_K GGUF

4-bit quantized GGUF of the 4Expert variant of DeepSeek V4 Flash, for use with ds4.

Model Summary

Property Value
Architecture DeepSeek V4 Flash (MoE + MLA)
top k 4
Layers 43
Hidden dim 4096
Attention heads 64 (MLA, head_dim=512, kv_head_dim=512)
Routed experts 256 (4 active per token)
FFN dim 2048
Shared experts 1
Vocab size 129,280
Max context 65,536
Quantization Q4_K (4-bit K-quant)
File size 164 GiB
Source safetensors cloudyu/DeepSeek-V4-Flash-4Expert

Independent Evaluation Results

We evaluated the model against the original top_k=6 configuration on HumanEval (code generation)

HumanEval (Pass@1)

##eval details

Configuration Pass@1 Generation Time
Top_k=4 (this model) 95.73% (157/164) 56.83s
Top_k=6 (original) 95.73% (157/164) 64.06s

GGUF Evaluation Report — 4Expert Q4_K GGUF BY ds4-eval

Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29

Summary

Framework Passed Total Pass Rate
AIME 2025 20 25 80%
GPQA Diamond 22 25 88%
SuperGPQA 22 25 88%
COMPSEC 16 17 94.11%
TOTAL 80 92 87%

Quantization Strategy

Compiled with deepseek4-quantize using a layer-specific policy:

Layer type Quant Affected tensors
Routed experts (w1/w2/w3) Q4_K blk.*.ffn_{gate,down,up}_exps.weight
Attention projections Q8_0 attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b
Shared expert FFN Q8_0 ffn_{gate,up,down}_shexp.weight
Output projection Q8_0 output.weight
Embedding F16 token_embd.weight
Attention (other) F16 compressor, indexer, sinks, norms
Dense (other) F16 hyper-connections, remaining 2D weights
1D tensors F32 layer norms, RMS norms, scales, biases (never quantized)

How to Use

Requires ds4 built from the 4Expert PR. The upstream ds4 defaults to 6 active experts and cannot load this GGUF. The PR is submitted upstream; until merged, use the branch:

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make cpu -j$(nproc)            # Linux
make -C gguf-tools -j$(nproc)

Then run:

ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100

Reproduce: Convert Safetensors to This GGUF

This GGUF was produced by the following pipeline. Anyone with the source safetensors can reproduce it.

One-Click Script

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)

This runs all 5 steps (clone, build, download, convert, test) in one go.

Manual Steps

For transparency, here is exactly how this GGUF was produced.

Step 1 — Download source safetensors

pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
"

Step 2 — Build ds4 and gguf-tools

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -C gguf-tools -j$(nproc)
make cpu -j$(nproc)

Step 3 — Generate GGUF template from safetensors metadata

python3 gguf-tools/gen_gguf_template.py \
  --hf ./DeepSeek-V4-Flash-4Expert \
  --out template.gguf

The template (~5.6 MB) contains metadata, tokenizer, and tensor descriptors (names, shapes, types) but no weight data. It describes where each tensor goes in the final GGUF.

Step 4 — Quantize weights into the final GGUF

./gguf-tools/deepseek4-quantize \
  --hf ./DeepSeek-V4-Flash-4Expert \
  --template template.gguf \
  --out DeepSeek-V4-Flash-4Expert-Q4K.gguf \
  --experts q4_k \
  --attention-proj q8_0 \
  --attention f16 \
  --shared q8_0 \
  --output q8_0 \
  --embedding f16 \
  --dense f16 \
  --threads $(nproc)

The quantizer reads each safetensors tensor, dequantizes from the storage format (F8_E4M3 or packed FP4 with E8M0 scales for experts, BF16/F32 for others), applies the target quantization, and writes to the output GGUF. Output is ~153 GiB.

Step 5 — Test the GGUF

ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100

Expected output: coherent English text continuation at ~26 t/s (CPU, 20 threads).

Technical Notes

Why Q4_K for experts and F16 for norms?

deepseek4-quantize applies the quantization policy selectively by tensor shape:

  • 1D tensors (norms, scales, biases): the policy never overrides the template type. They stay F32 regardless of what --dense or --attention say.
  • 2D+ tensors: the policy applies the most specific matching flag:
    • Expert tensors (blk.*.ffn_*_exps.weight) → --experts
    • Attention projections (attn_q_a/b, attn_kv, attn_output_a/b) → --attention-proj
    • Shared expert weights → --shared
    • Output head → --output
    • Token embedding → --embedding
    • Other attention/indexer/compressor → --attention
    • Everything else 2D+ → --dense

How the template maps HF names to GGUF names

gen_gguf_template.py uses the same layer_map table as deepseek4-quantize.c. For example:

HF safetensors name GGUF name
layers.0.attn.wq_a.weight blk.0.attn_q_a.weight
layers.0.attn.wkv.weight blk.0.attn_kv.weight
layers.0.ffn.experts.0.w1.weight blk.0.ffn_gate_exps.weight (all 256 experts stacked)
layers.0.ffn.shared_experts.w1.weight blk.0.ffn_gate_shexp.weight
embed.weight token_embd.weight
norm.weight output_norm.weight

The script also automatically converts the ffn.gate.tid2eid routing table from I64 to I32, which is the only non-F32/F16 tensor type override in the template.

4Expert vs 6Expert: What Changed in ds4

The upstream ds4 hardcodes 6 active routed experts per token (n_expert_used = 6). For this 4Expert model to work:

  1. Default changed to 4DS4_SHAPE_FLASH.n_expert_used and g_ds4_shape.n_expert_used now default to 4.
  2. Backward compatible — When loading a GGUF with n_expert_used = 6 in its metadata, ds4 preserves 6 at runtime. Old 6-expert GGUF files continue to work.
  3. Template generatorgen_gguf_template.py handles the full tensor mapping, replacing manual template construction.

Full details: PR #474

GGUF Evaluation Report — DeepSeek V4 Flash 4Expert Q4_K GGUF BY ds4-eval

Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29

Summary

Framework Passed Total Pass Rate
AIME 2025 20 25 80%
GPQA Diamond 22 25 88%
SuperGPQA 22 25 88%
COMPSEC 16 17 94.11%
TOTAL 80 92 87%

80 of 92 tests passed. The 12 failures are detailed below.

Evaluation Methodology

All tests were run using ds4-eval, the built-in evaluation tool shipped with the ds4 inference engine. Each test case consists of a prompt and a set of valid ground-truth answers (e.g., A, B, C, D for multiple choice; integer answers for AIME; ranges or enumerations for COMPSEC).

The evaluator feeds the prompt to the model, reads the generated completion, and extracts the final answer using framework-specific parsers. A test passes if the extracted answer matches any of the valid ground-truth values.

Evaluation Frameworks

  • AIME 2025 (25 tests): American Invitational Mathematics Examination. Integer answers (0–999). Tests mathematical reasoning.
  • GPQA Diamond (25 tests): Graduate-level multiple-choice science questions. Options A–D. Tests deep domain knowledge.
  • SuperGPQA (25 tests): Expanded graduate-level multiple choice. Options A–J. Broader and harder than GPQA.
  • COMPSEC (17 tests): Computer security questions. Answers are integer codes or ranges (e.g., 5, 10-15, 3,13-15). Tests specialized security knowledge.

Scoring Rules

  • AIME: exact integer match.
  • GPQA / SuperGPQA: exact option letter match (A–J).
  • COMPSEC: answer must fall within one of the accepted integer values or ranges.

Hardware & Build

Component Detail
Device Apple M2 Ultra
RAM 192 GiB unified memory
Backend Metal (ds4 GPU backend)
Operating system macOS

Build Configuration

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -j$(sysctl -n hw.ncpu)

No flags passed — standard release build (-O3 -ffast-math -mcpu=native).

Runtime Configuration

GGUF loaded via memory-mapped I/O. Key runtime parameters from ds4-eval output:

ds4: Metal device Apple M2 Ultra, 192.00 GiB RAM
ds4: Metal 4 tensor API disabled for pre-M5/pre-A19 devices
ds4: drift-patch flags hc_stable=on norm_unify=on kv_raw_f32=off rope_exp2_log2=off
ds4-eval: context auto-sized to 16777 tokens
ds4-eval: context buffers 630.30 MiB
ds4-eval: model shape DeepSeek V4 Flash

No environment variables or config overrides were set beyond the default.

Detailed Results

AIME 2025 (13/15, 86.7%)

# Test Given Correct Result Note
3 aime2025-01 70 70 PASSED
6 aime2025-16 468 468 PASSED
9 aime2025-02 588 588 PASSED
12 aime2025-03 16 16 PASSED
15 aime2025-18 82 82 PASSED
18 aime2025-04 117 117 PASSED
21 aime2025-19 106 106 PASSED
24 aime2025-05 279 279 PASSED
27 aime2025-06 504 504 PASSED
30 aime2025-21 293 293 PASSED
33 aime2025-07 5 821 FAILED gen truncated at 16,000 tok
36 aime2025-22 237 237 PASSED
39 aime2025-08 77 77 PASSED
42 aime2025-09 62 62 PASSED
45 aime2025-24 149 149 PASSED
48 aime2025-10 59049 81 FAILED gen truncated at 16,000 tok
51 aime2025-25 907 907 PASSED
54 aime2025-26 113 113 PASSED
57 aime2025-12 510 510 PASSED
60 aime2025-27 19 19 PASSED
63 aime2025-13 2 204 FAILED gen truncated at 16,000 tok
66 aime2025-28 3 248 FAILED gen truncated at 16,000 tok
69 aime2025-29 104 104 PASSED
72 aime2025-15 0 735 FAILED gen truncated at 16,000 tok
75 aime2025-30 240 240 PASSED

All 5 AIME failures are generation truncation — the model hit the 16,000 token budget before finishing the chain-of-thought and producing a final answer. The budget was auto-sized by ds4-eval as largest_prompt + 16,000.

GPQA Diamond (8/10, 80.0%)

# Test Given Correct Result
1 recNu3MXkvWUzHZr9 B B PASSED
4 recoiTJPGUmzAkief C C PASSED
7 rec4UqStf9WUVif1f B B PASSED
10 recgI6tUQ7RLJRWGx B B PASSED
13 recDytVnNYZe2HuUU A A PASSED
16 recNFJjE5PPTqVJGv D D PASSED
19 rec2UlKqC6RFHdcro B B PASSED
22 recv7GsQg3f0fvB1f B B PASSED
25 recrHBEJJoDTV05JR C C PASSED
28 recb80OwMgNnceA9t D D PASSED
31 recA1i5ZAh0Uzclxp C C PASSED
34 recqGD3fxPCI59vPQ B B PASSED
37 rechKl68Uc6H7vU0N A A PASSED
40 rec1zl5LvaatzGhFt B B PASSED
43 recTs7qzfJs6kfLUK A A PASSED
46 rec32C1ZEapBnCC0E C C PASSED
49 recZWeueB7lSPR6wN B B PASSED
52 recVvpD8miVjmmyfe C C PASSED
55 recAAJoHMW45Lv5je D D PASSED
58 reckEnrOPFT9Ru7tW D C FAILED
61 rec8nshandHARTkrg A A PASSED
64 recFaL6j8UMhutXrc A A PASSED
67 reczQ4I0VpENdMtIj A C FAILED
70 recWxGU8Q4YReJ1tb B C FAILED
73 recMicVBcqy1xM1jq B B PASSED

SuperGPQA (12/15, 80.0%)

# Test Given Correct Result
2 001b51d76b4d C C PASSED
5 b7e20eac9876 J J PASSED
8 4a1d1780a93f E E PASSED
11 6082513c8dba A A PASSED
14 bebf1ed45ae1 J J PASSED
17 7ca71b863277 I I PASSED
20 d44b94f77493 E E PASSED
23 febe406f44d7 B B PASSED
26 31950dc80ded C C PASSED
29 0f14cd17be17 C C PASSED
32 cef9bcc08743 J J PASSED
35 9f93aa2cfdb5 I I PASSED
38 97ad69dda7b2 E E PASSED
41 e78e4e539d6f E H FAILED
44 8483667a25e7 A A PASSED
47 e5ed76ef9814 A A PASSED
50 fd7924876c48 H H PASSED
53 6bfe7d19299d I I PASSED
56 e1825d70c584 J J PASSED
59 ab430ac3f18e A A PASSED
62 e8c5da5ca406 F F PASSED
65 05efdc6fb240 H H PASSED
68 ba52e06cbe1a H H PASSED
71 591a77df2132 D F FAILED
74 e780f37a5baa J H FAILED

COMPSEC (14/15, 93.3%)

# Test Given Correct Result
76 compsec-076 20 17-20 PASSED
77 compsec-077 18,19,20 18-20 PASSED
78 compsec-078 11 11 PASSED
79 compsec-079 0 18-19 FAILED
80 compsec-080 5 5-6 PASSED
81 compsec-081 10 10-15 PASSED
82 compsec-082 9,10 9-10 PASSED
83 compsec-083 10 9-11 PASSED
84 compsec-084 7 6-7 PASSED
85 compsec-085 5 5 PASSED
86 compsec-086 3 3,13-15 PASSED
87 compsec-087 8 8,20-22 PASSED
88 compsec-088 11 11 PASSED
89 compsec-089 10 10 PASSED
90 compsec-090 12 12-13 PASSED
91 compsec-091 3 3 PASSED
92 compsec-092 10,11 10-14 PASSED

Failure Analysis (12 failed)

# Framework Test Answer Expected Root cause
33 AIME2025 aime2025-07 5 821 Gen truncated at 16k tok
48 AIME2025 aime2025-10 59049 81 Gen truncated at 16k tok
63 AIME2025 aime2025-13 2 204 Gen truncated at 16k tok
66 AIME2025 aime2025-28 3 248 Gen truncated at 16k tok
72 AIME2025 aime2025-15 0 735 Gen truncated at 16k tok
41 SuperGPQA e78e4e53 E H Wrong answer
71 SuperGPQA 591a77df D F Wrong answer
74 SuperGPQA e780f37a J H Wrong answer
58 GPQA Diamond reckEnrOPF D C Wrong answer
67 GPQA Diamond reczQ4I0Vp A C Wrong answer
70 GPQA Diamond recWxGU8Q4 B C Wrong answer
79 COMPSEC compsec-079 0 18-19 Wrong answer

Of the 12 failures:

  • 5 are AIME chain-of-thought truncation (context budget = 16,777 tokens, generation budget = 16,000 tokens). The model needed more tokens to finish reasoning. These would likely pass with a larger context window.
  • 7 are genuine incorrect answers (3 GPQA, 3 SuperGPQA, 1 COMPSEC).

Excluding truncation failures, the pass rate is 80/87 = 92.0%.

Reproduction

git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
make -j$(sysctl -n hw.ncpu)

pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
"

./ds4-eval -m DeepSeek-V4-Flash-4Expert-Q4K.gguf

On Linux, replace the build step with make cpu -j$(nproc). On CUDA systems, use make cuda-generic -j$(nproc).

Downloads last month
485
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support