Instructions to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cloudyu/DeepSeek-V4-Flash-4Expert-GGUF",
	filename="ds4flash-4expert.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
# Run inference directly in the terminal:
llama cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
# Run inference directly in the terminal:
llama cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
# Run inference directly in the terminal:
./llama-cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Use Docker

docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

LM Studio
Jan

vLLM

How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cloudyu/DeepSeek-V4-Flash-4Expert-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/DeepSeek-V4-Flash-4Expert-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Ollama
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Ollama:
```
ollama run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
```

Unsloth Studio

How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for cloudyu/DeepSeek-V4-Flash-4Expert-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Docker Model Runner:
```
docker model run hf.co/cloudyu/DeepSeek-V4-Flash-4Expert-GGUF
```

Lemonade

How to use cloudyu/DeepSeek-V4-Flash-4Expert-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull cloudyu/DeepSeek-V4-Flash-4Expert-GGUF

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-4Expert-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

DeepSeek V4 Flash 4Expert — Q4_K GGUF

4-bit quantized GGUF of the 4Expert variant of DeepSeek V4 Flash, for use with ds4.

Model Summary

Property	Value
Architecture	DeepSeek V4 Flash (MoE + MLA)
top k	4
Layers	43
Hidden dim	4096
Attention heads	64 (MLA, head_dim=512, kv_head_dim=512)
Routed experts	256 (4 active per token)
FFN dim	2048
Shared experts	1
Vocab size	129,280
Max context	65,536
Quantization	Q4_K (4-bit K-quant)
File size	164 GiB
Source safetensors	cloudyu/DeepSeek-V4-Flash-4Expert

Independent Evaluation Results

We evaluated the model against the original top_k=6 configuration on HumanEval (code generation)

HumanEval (Pass@1)

##eval details

Configuration	Pass@1	Generation Time
Top_k=4 (this model)	95.73% (157/164)	56.83s
Top_k=6 (original)	95.73% (157/164)	64.06s

GGUF Evaluation Report — 4Expert Q4_K GGUF BY ds4-eval

Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29

Summary

Framework	Passed	Total	Pass Rate
AIME 2025	20	25	80%
GPQA Diamond	22	25	88%
SuperGPQA	22	25	88%
COMPSEC	16	17	94.11%
TOTAL	80	92	87%

Quantization Strategy

Compiled with deepseek4-quantize using a layer-specific policy:

Layer type	Quant	Affected tensors
Routed experts (w1/w2/w3)	Q4_K	`blk.*.ffn_{gate,down,up}_exps.weight`
Attention projections	Q8_0	`attn_q_a`, `attn_q_b`, `attn_kv`, `attn_output_a`, `attn_output_b`
Shared expert FFN	Q8_0	`ffn_{gate,up,down}_shexp.weight`
Output projection	Q8_0	`output.weight`
Embedding	F16	`token_embd.weight`
Attention (other)	F16	compressor, indexer, sinks, norms
Dense (other)	F16	hyper-connections, remaining 2D weights
1D tensors	F32	layer norms, RMS norms, scales, biases (never quantized)

How to Use

Requires ds4 built from the 4Expert PR. The upstream ds4 defaults to 6 active experts and cannot load this GGUF. The PR is submitted upstream; until merged, use the branch:

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make cpu -j$(nproc)            # Linux
make -C gguf-tools -j$(nproc)

Then run:

ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100

Reproduce: Convert Safetensors to This GGUF

This GGUF was produced by the following pipeline. Anyone with the source safetensors can reproduce it.

One-Click Script

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)

This runs all 5 steps (clone, build, download, convert, test) in one go.

Manual Steps

For transparency, here is exactly how this GGUF was produced.

Step 1 — Download source safetensors

pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
"

Step 2 — Build ds4 and gguf-tools

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -C gguf-tools -j$(nproc)
make cpu -j$(nproc)

Step 3 — Generate GGUF template from safetensors metadata

python3 gguf-tools/gen_gguf_template.py \
  --hf ./DeepSeek-V4-Flash-4Expert \
  --out template.gguf

The template (~5.6 MB) contains metadata, tokenizer, and tensor descriptors (names, shapes, types) but no weight data. It describes where each tensor goes in the final GGUF.

Step 4 — Quantize weights into the final GGUF

./gguf-tools/deepseek4-quantize \
  --hf ./DeepSeek-V4-Flash-4Expert \
  --template template.gguf \
  --out DeepSeek-V4-Flash-4Expert-Q4K.gguf \
  --experts q4_k \
  --attention-proj q8_0 \
  --attention f16 \
  --shared q8_0 \
  --output q8_0 \
  --embedding f16 \
  --dense f16 \
  --threads $(nproc)

The quantizer reads each safetensors tensor, dequantizes from the storage format (F8_E4M3 or packed FP4 with E8M0 scales for experts, BF16/F32 for others), applies the target quantization, and writes to the output GGUF. Output is ~153 GiB.

Step 5 — Test the GGUF

ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100

Expected output: coherent English text continuation at ~26 t/s (CPU, 20 threads).

Technical Notes

Why Q4_K for experts and F16 for norms?

deepseek4-quantize applies the quantization policy selectively by tensor shape:

1D tensors (norms, scales, biases): the policy never overrides the template type. They stay F32 regardless of what --dense or --attention say.
2D+ tensors: the policy applies the most specific matching flag:
- Expert tensors (blk.*.ffn_*_exps.weight) → --experts
- Attention projections (attn_q_a/b, attn_kv, attn_output_a/b) → --attention-proj
- Shared expert weights → --shared
- Output head → --output
- Token embedding → --embedding
- Other attention/indexer/compressor → --attention
- Everything else 2D+ → --dense

How the template maps HF names to GGUF names

gen_gguf_template.py uses the same layer_map table as deepseek4-quantize.c. For example:

HF safetensors name	GGUF name
`layers.0.attn.wq_a.weight`	`blk.0.attn_q_a.weight`
`layers.0.attn.wkv.weight`	`blk.0.attn_kv.weight`
`layers.0.ffn.experts.0.w1.weight`	`blk.0.ffn_gate_exps.weight` (all 256 experts stacked)
`layers.0.ffn.shared_experts.w1.weight`	`blk.0.ffn_gate_shexp.weight`
`embed.weight`	`token_embd.weight`
`norm.weight`	`output_norm.weight`

The script also automatically converts the ffn.gate.tid2eid routing table from I64 to I32, which is the only non-F32/F16 tensor type override in the template.

4Expert vs 6Expert: What Changed in ds4

The upstream ds4 hardcodes 6 active routed experts per token (n_expert_used = 6). For this 4Expert model to work:

Default changed to 4 — DS4_SHAPE_FLASH.n_expert_used and g_ds4_shape.n_expert_used now default to 4.
Backward compatible — When loading a GGUF with n_expert_used = 6 in its metadata, ds4 preserves 6 at runtime. Old 6-expert GGUF files continue to work.
Template generator — gen_gguf_template.py handles the full tensor mapping, replacing manual template construction.

Full details: PR #474

GGUF Evaluation Report — DeepSeek V4 Flash 4Expert Q4_K GGUF BY ds4-eval

Model: cloudyu/DeepSeek-V4-Flash-4Expert-GGUF Source safetensors: cloudyu/DeepSeek-V4-Flash-4Expert Date: 2026-06-29

Summary

Framework	Passed	Total	Pass Rate
AIME 2025	20	25	80%
GPQA Diamond	22	25	88%
SuperGPQA	22	25	88%
COMPSEC	16	17	94.11%
TOTAL	80	92	87%

80 of 92 tests passed. The 12 failures are detailed below.

Evaluation Methodology

All tests were run using ds4-eval, the built-in evaluation tool shipped with the ds4 inference engine. Each test case consists of a prompt and a set of valid ground-truth answers (e.g., A, B, C, D for multiple choice; integer answers for AIME; ranges or enumerations for COMPSEC).

The evaluator feeds the prompt to the model, reads the generated completion, and extracts the final answer using framework-specific parsers. A test passes if the extracted answer matches any of the valid ground-truth values.

Evaluation Frameworks

AIME 2025 (25 tests): American Invitational Mathematics Examination. Integer answers (0–999). Tests mathematical reasoning.
GPQA Diamond (25 tests): Graduate-level multiple-choice science questions. Options A–D. Tests deep domain knowledge.
SuperGPQA (25 tests): Expanded graduate-level multiple choice. Options A–J. Broader and harder than GPQA.
COMPSEC (17 tests): Computer security questions. Answers are integer codes or ranges (e.g., 5, 10-15, 3,13-15). Tests specialized security knowledge.

Scoring Rules

AIME: exact integer match.
GPQA / SuperGPQA: exact option letter match (A–J).
COMPSEC: answer must fall within one of the accepted integer values or ranges.

Hardware & Build

Component	Detail
Device	Apple M2 Ultra
RAM	192 GiB unified memory
Backend	Metal (ds4 GPU backend)
Operating system	macOS

Build Configuration

git clone https://github.com/yuhai-china/ds4
cd ds4
git checkout 4expert
make -j$(sysctl -n hw.ncpu)

No flags passed — standard release build (-O3 -ffast-math -mcpu=native).

Runtime Configuration

GGUF loaded via memory-mapped I/O. Key runtime parameters from ds4-eval output:

ds4: Metal device Apple M2 Ultra, 192.00 GiB RAM
ds4: Metal 4 tensor API disabled for pre-M5/pre-A19 devices
ds4: drift-patch flags hc_stable=on norm_unify=on kv_raw_f32=off rope_exp2_log2=off
ds4-eval: context auto-sized to 16777 tokens
ds4-eval: context buffers 630.30 MiB
ds4-eval: model shape DeepSeek V4 Flash

No environment variables or config overrides were set beyond the default.

Detailed Results

AIME 2025 (13/15, 86.7%)

#	Test	Given	Correct	Result	Note
3	aime2025-01	70	70	PASSED
6	aime2025-16	468	468	PASSED
9	aime2025-02	588	588	PASSED
12	aime2025-03	16	16	PASSED
15	aime2025-18	82	82	PASSED
18	aime2025-04	117	117	PASSED
21	aime2025-19	106	106	PASSED
24	aime2025-05	279	279	PASSED
27	aime2025-06	504	504	PASSED
30	aime2025-21	293	293	PASSED
33	aime2025-07	5	821	FAILED	gen truncated at 16,000 tok
36	aime2025-22	237	237	PASSED
39	aime2025-08	77	77	PASSED
42	aime2025-09	62	62	PASSED
45	aime2025-24	149	149	PASSED
48	aime2025-10	59049	81	FAILED	gen truncated at 16,000 tok
51	aime2025-25	907	907	PASSED
54	aime2025-26	113	113	PASSED
57	aime2025-12	510	510	PASSED
60	aime2025-27	19	19	PASSED
63	aime2025-13	2	204	FAILED	gen truncated at 16,000 tok
66	aime2025-28	3	248	FAILED	gen truncated at 16,000 tok
69	aime2025-29	104	104	PASSED
72	aime2025-15	0	735	FAILED	gen truncated at 16,000 tok
75	aime2025-30	240	240	PASSED

All 5 AIME failures are generation truncation — the model hit the 16,000 token budget before finishing the chain-of-thought and producing a final answer. The budget was auto-sized by ds4-eval as largest_prompt + 16,000.

GPQA Diamond (8/10, 80.0%)

#	Test	Given	Correct	Result
1	recNu3MXkvWUzHZr9	B	B	PASSED
4	recoiTJPGUmzAkief	C	C	PASSED
7	rec4UqStf9WUVif1f	B	B	PASSED
10	recgI6tUQ7RLJRWGx	B	B	PASSED
13	recDytVnNYZe2HuUU	A	A	PASSED
16	recNFJjE5PPTqVJGv	D	D	PASSED
19	rec2UlKqC6RFHdcro	B	B	PASSED
22	recv7GsQg3f0fvB1f	B	B	PASSED
25	recrHBEJJoDTV05JR	C	C	PASSED
28	recb80OwMgNnceA9t	D	D	PASSED
31	recA1i5ZAh0Uzclxp	C	C	PASSED
34	recqGD3fxPCI59vPQ	B	B	PASSED
37	rechKl68Uc6H7vU0N	A	A	PASSED
40	rec1zl5LvaatzGhFt	B	B	PASSED
43	recTs7qzfJs6kfLUK	A	A	PASSED
46	rec32C1ZEapBnCC0E	C	C	PASSED
49	recZWeueB7lSPR6wN	B	B	PASSED
52	recVvpD8miVjmmyfe	C	C	PASSED
55	recAAJoHMW45Lv5je	D	D	PASSED
58	reckEnrOPFT9Ru7tW	D	C	FAILED
61	rec8nshandHARTkrg	A	A	PASSED
64	recFaL6j8UMhutXrc	A	A	PASSED
67	reczQ4I0VpENdMtIj	A	C	FAILED
70	recWxGU8Q4YReJ1tb	B	C	FAILED
73	recMicVBcqy1xM1jq	B	B	PASSED

SuperGPQA (12/15, 80.0%)

#	Test	Given	Correct	Result
2	001b51d76b4d	C	C	PASSED
5	b7e20eac9876	J	J	PASSED
8	4a1d1780a93f	E	E	PASSED
11	6082513c8dba	A	A	PASSED
14	bebf1ed45ae1	J	J	PASSED
17	7ca71b863277	I	I	PASSED
20	d44b94f77493	E	E	PASSED
23	febe406f44d7	B	B	PASSED
26	31950dc80ded	C	C	PASSED
29	0f14cd17be17	C	C	PASSED
32	cef9bcc08743	J	J	PASSED
35	9f93aa2cfdb5	I	I	PASSED
38	97ad69dda7b2	E	E	PASSED
41	e78e4e539d6f	E	H	FAILED
44	8483667a25e7	A	A	PASSED
47	e5ed76ef9814	A	A	PASSED
50	fd7924876c48	H	H	PASSED
53	6bfe7d19299d	I	I	PASSED
56	e1825d70c584	J	J	PASSED
59	ab430ac3f18e	A	A	PASSED
62	e8c5da5ca406	F	F	PASSED
65	05efdc6fb240	H	H	PASSED
68	ba52e06cbe1a	H	H	PASSED
71	591a77df2132	D	F	FAILED
74	e780f37a5baa	J	H	FAILED

COMPSEC (14/15, 93.3%)

#	Test	Given	Correct	Result
76	compsec-076	20	17-20	PASSED
77	compsec-077	18,19,20	18-20	PASSED
78	compsec-078	11	11	PASSED
79	compsec-079	0	18-19	FAILED
80	compsec-080	5	5-6	PASSED
81	compsec-081	10	10-15	PASSED
82	compsec-082	9,10	9-10	PASSED
83	compsec-083	10	9-11	PASSED
84	compsec-084	7	6-7	PASSED
85	compsec-085	5	5	PASSED
86	compsec-086	3	3,13-15	PASSED
87	compsec-087	8	8,20-22	PASSED
88	compsec-088	11	11	PASSED
89	compsec-089	10	10	PASSED
90	compsec-090	12	12-13	PASSED
91	compsec-091	3	3	PASSED
92	compsec-092	10,11	10-14	PASSED

Failure Analysis (12 failed)

#	Framework	Test	Answer	Expected	Root cause
33	AIME2025	aime2025-07	5	821	Gen truncated at 16k tok
48	AIME2025	aime2025-10	59049	81	Gen truncated at 16k tok
63	AIME2025	aime2025-13	2	204	Gen truncated at 16k tok
66	AIME2025	aime2025-28	3	248	Gen truncated at 16k tok
72	AIME2025	aime2025-15	0	735	Gen truncated at 16k tok
41	SuperGPQA	e78e4e53	E	H	Wrong answer
71	SuperGPQA	591a77df	D	F	Wrong answer
74	SuperGPQA	e780f37a	J	H	Wrong answer
58	GPQA Diamond	reckEnrOPF	D	C	Wrong answer
67	GPQA Diamond	reczQ4I0Vp	A	C	Wrong answer
70	GPQA Diamond	recWxGU8Q4	B	C	Wrong answer
79	COMPSEC	compsec-079	0	18-19	Wrong answer

Of the 12 failures:

5 are AIME chain-of-thought truncation (context budget = 16,777 tokens, generation budget = 16,000 tokens). The model needed more tokens to finish reasoning. These would likely pass with a larger context window.
7 are genuine incorrect answers (3 GPQA, 3 SuperGPQA, 1 COMPSEC).

Excluding truncation failures, the pass rate is 80/87 = 92.0%.

Reproduction

git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
make -j$(sysctl -n hw.ncpu)

pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
"

./ds4-eval -m DeepSeek-V4-Flash-4Expert-Q4K.gguf

On Linux, replace the build step with make cpu -j$(nproc). On CUDA systems, use make cuda-generic -j$(nproc).

Downloads last month: 485

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants