Instructions to use autotrust/DeepSeek-V4-Flash-4E with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use autotrust/DeepSeek-V4-Flash-4E with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="autotrust/DeepSeek-V4-Flash-4E")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("autotrust/DeepSeek-V4-Flash-4E")
model = AutoModelForCausalLM.from_pretrained("autotrust/DeepSeek-V4-Flash-4E")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use autotrust/DeepSeek-V4-Flash-4E with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "autotrust/DeepSeek-V4-Flash-4E"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/DeepSeek-V4-Flash-4E",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/autotrust/DeepSeek-V4-Flash-4E

SGLang

How to use autotrust/DeepSeek-V4-Flash-4E with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "autotrust/DeepSeek-V4-Flash-4E" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/DeepSeek-V4-Flash-4E",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "autotrust/DeepSeek-V4-Flash-4E" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/DeepSeek-V4-Flash-4E",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use autotrust/DeepSeek-V4-Flash-4E with Docker Model Runner:
```
docker model run hf.co/autotrust/DeepSeek-V4-Flash-4E
```

DeepSeek-V4-Flash-4E

A fine-tuned variant of DeepSeek-V4-Flash with top k=4 for optimal inference efficiency.

HuggingFace: autotrust/DeepSeek-V4-Flash-4E

Released by AutoTrust AI Lab · Adapted by Hai Yu (cloudyu)

What is DeepSeek-V4-Flash-4E?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses num_experts_per_tok=6 by default.

DeepSeek-V4-Flash-4E is a post-processed variant of the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:

Reduces inference compute by ~33% (fewer active experts per forward pass)
Improves generation throughput by ~8–11%
Maintains or improves accuracy on both code generation and knowledge benchmarks
Uses the same FP4 + FP8 mixed precision format as the original

Why top_k=4 Instead of 6?

The original num_experts_per_tok=6 is not a power of 2. In practice, this means:

GPU tensor core utilization is suboptimal for certain MoE dispatch shapes
Memory alignment and warp scheduling are less efficient compared to power-of-2 expert counts
The routing decision per token requires computing softmax over 6 logits instead of 4, introducing unnecessary overhead

Setting top_k to 4 (a power of 2) gives the GPU's SIMT architecture a natural alignment for expert dispatch and attention masking, while activating 33% fewer parameters per token with no accuracy degradation—and in many reasoning-heavy tasks, a measurable accuracy improvement.

Key Changes from the Original

Configuration	Original (top_k=6)	This Model (top_k=4)
`num_experts_per_tok`	6	4
Activated params	~13B	~11B
Total params	284B	284B
Routing method	`noaux_tc`	`noaux_tc`
All other weights	identical	identical

The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.

Independent Evaluation Results

Test Environment

Item	Value
Model	DeepSeek-V4-Flash (284B MoE, FP4+FP8 mixed precision)
Engine	vLLM 0.23.0
GPU	Single NVIDIA B300 (274 GB)
KV Cache dtype	fp8
Sampling	temperature=0.0, top_p=0.95
Stop token	`<｜end▁of▁sentence｜>`
Chat format	encoding_dsv4.py, chat mode for MMLU-Pro, thinking mode for HumanEval

HumanEval (Pass@1)

Configuration	Pass@1	Generation Time	Time per Sample
Top_k=4 (this model)	95.73% (157/164)	56.83s	0.35s
Top_k=6 (original)	95.73% (157/164)	64.06s	0.39s

Identical accuracy on code generation — same 157/164 pass rate.
~11–13% faster generation (top_k=4 uses ~33% fewer activated experts per forward pass).

Problem-Level Error Analysis

Both configurations fail on the same 4 problems (has_close_elements, decode_cyclic, is_nested, order_by_points), suggesting these are inherent model capability limitations rather than routing artifacts.

Group	Count	Problems
Both fail	4	HumanEval/0, /38, /132, /145
top_k=4 only fails	3	HumanEval/50 (`decode_shift`), /94 (`skjkasdkd`), /116 (`sort_array`)
top_k=6 only fails	3	HumanEval/65 (`circular_shift`), /129 (`minPath`), /160 (`do_algebra`)

MMLU-Pro (Accuracy)

Configuration	Accuracy	Generation Time
Top_k=4 (this model)	41.46% (4988/12032)	78.24s
Top_k=6 (original)	37.77% (4545/12032)	85.16s

+3.69 percentage points higher accuracy across 12,032 questions
~8% faster generation

Category Breakdown

Category	top_k=4	top_k=6	Delta
biology	68.62% (492/717)	72.66% (521/717)	−4.04pp
business	39.04% (308/789)	21.67% (171/789)	+17.36pp
chemistry	14.58% (165/1132)	7.16% (81/1132)	+7.42pp
computer science	47.80% (196/410)	44.63% (183/410)	+3.17pp
economics	66.35% (560/844)	65.05% (549/844)	+1.30pp
engineering	25.39% (246/969)	13.21% (128/969)	+12.18pp
health	59.54% (487/818)	63.08% (516/818)	−3.55pp
history	50.13% (191/381)	59.58% (227/381)	−9.45pp
law	33.51% (369/1101)	35.88% (395/1101)	−2.36pp
math	28.13% (380/1351)	15.47% (209/1351)	+12.66pp
other	55.09% (509/924)	56.71% (524/924)	−1.62pp
philosophy	53.91% (269/499)	55.71% (278/499)	−1.80pp
physics	20.32% (264/1299)	14.55% (189/1299)	+5.77pp
psychology	69.17% (552/798)	71.93% (574/798)	−2.76pp

Key observations:

top_k=4 dominates STEM and business: business (+17.36pp), math (+12.66pp), engineering (+12.18pp), chemistry (+7.42pp), physics (+5.77pp), computer science (+3.17pp). These categories require precise numerical computation, formula derivation, or logical reasoning — activating fewer experts produces more stable outputs.
top_k=6 leads modestly in humanities/life sciences: history (+9.45pp), biology (+4.04pp), health (+3.55pp), psychology (+2.76pp), law (+2.36pp), philosophy (+1.80pp). These categories rely more on knowledge recall and semantic understanding.
Net advantage: top_k=4 correctly answers 1040 questions that top_k=6 gets wrong, while top_k=6 only answers 597 questions that top_k=4 misses — a 1.74× advantage for top_k=4.

Confidence Analysis

top_k=4 consistently produces cleaner output on multiple-choice questions — it is more likely to emit a single letter answer (A-J) directly, whereas top_k=6 occasionally generates verbose or malformed responses that fail to match the extraction regex. This contributes partially to the accuracy gap.

Error Intersection Map

                  Both correct        top_k=4 ✓, top_k=6 ✗
                      3948                   1040
                  ┌──────────────┐   ┌──────────────┐
                  │              │   │ math:    200  │
                  │              │   │ business:162  │
                  │              │   │ eng:     150  │
                  │              │   │ physics: 104  │
                  │              │   │ chem:    101  │
                  │              │   │ ...           │
                  └──────────────┘   └──────────────┘
                  Both wrong          top_k=6 ✓, top_k=4 ✗
                      6447                    597
                  ┌──────────────┐   ┌──────────────┐
                  │              │   │ other:   75   │
                  │              │   │ law:     63   │
                  │              │   │ health:  63   │
                  │              │   │ econ:    58   │
                  │              │   │ biology: 54   │
                  │              │   │ ...           │
                  └──────────────┘   └──────────────┘

Speed Analysis

Phase	top_k=4	top_k=6	Delta
Model load	23.85s	26.52s	+2.67s
Engine init	173.65s	185.14s	+11.49s
Generation (HumanEval)	56.83s	64.06s	+7.23s (+12.7%)
Generation (MMLU-Pro)	78.24s	85.16s	+6.92s (+8.8%)

top_k=6 activates 50% more experts per token but wall-clock generation time increases by only ~9–13%, confirming that GPU compute and memory bandwidth are partially overlapped.

Summary

top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
The improvement is particularly pronounced on math, engineering, business, chemistry, and physics reasoning tasks
The original top_k=6 provides marginal benefits only in humanities/life sciences categories
For production deployment, top_k=4 is the recommended configuration

Full evaluation reports, scripts, and raw results are available in the eval/ directory of this repository.

Model Downloads

Model	#Total Params	#Activated Params	Context Length	Precision	Download
DeepSeek-V4-Flash (original)	284B	~13B (top_k=6)	1M	FP4 + FP8 Mixed	HuggingFace
DeepSeek-V4-Flash-4E (this)	284B	~11B (top_k=4)	1M	FP4 + FP8 Mixed	HuggingFace

Chat Template

This release does not include a Jinja-format chat template. Instead, the encoding/ folder provides Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding/README.md for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("autotrust/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

Note: This encoding script is only needed when using the model through HuggingFace Transformers or vLLM directly. Inference engines that natively support the DeepSeek-V4 chat format (e.g., ds4) handle prompt construction internally and do not require it.

How to Run Locally

Please refer to the inference/ folder for detailed instructions on running DeepSeek-V4 locally using the official DeepSeek inference code, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Contact

If you have any questions, please raise an issue on HuggingFace.

Downloads last month: -

Safetensors

Model size

158B params

Tensor type

BF16

I64

F32

F8_E8M0

F8_E4M3