DeepSeek-V4-Flash-4E

A fine-tuned variant of DeepSeek-V4-Flash with top k=4 for optimal inference efficiency.

HuggingFace: autotrust/DeepSeek-V4-Flash-4E

Released by AutoTrust AI Lab · Adapted by Hai Yu (cloudyu)

What is DeepSeek-V4-Flash-4E?

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses num_experts_per_tok=6 by default.

DeepSeek-V4-Flash-4E is a post-processed variant of the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:

  • Reduces inference compute by ~33% (fewer active experts per forward pass)
  • Improves generation throughput by ~8–11%
  • Maintains or improves accuracy on both code generation and knowledge benchmarks
  • Uses the same FP4 + FP8 mixed precision format as the original

Why top_k=4 Instead of 6?

The original num_experts_per_tok=6 is not a power of 2. In practice, this means:

  • GPU tensor core utilization is suboptimal for certain MoE dispatch shapes
  • Memory alignment and warp scheduling are less efficient compared to power-of-2 expert counts
  • The routing decision per token requires computing softmax over 6 logits instead of 4, introducing unnecessary overhead

Setting top_k to 4 (a power of 2) gives the GPU's SIMT architecture a natural alignment for expert dispatch and attention masking, while activating 33% fewer parameters per token with no accuracy degradation—and in many reasoning-heavy tasks, a measurable accuracy improvement.

Key Changes from the Original

Configuration Original (top_k=6) This Model (top_k=4)
num_experts_per_tok 6 4
Activated params ~13B ~11B
Total params 284B 284B
Routing method noaux_tc noaux_tc
All other weights identical identical

The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.

Independent Evaluation Results

Test Environment

Item Value
Model DeepSeek-V4-Flash (284B MoE, FP4+FP8 mixed precision)
Engine vLLM 0.23.0
GPU Single NVIDIA B300 (274 GB)
KV Cache dtype fp8
Sampling temperature=0.0, top_p=0.95
Stop token <|end▁of▁sentence|>
Chat format encoding_dsv4.py, chat mode for MMLU-Pro, thinking mode for HumanEval

HumanEval (Pass@1)

Configuration Pass@1 Generation Time Time per Sample
Top_k=4 (this model) 95.73% (157/164) 56.83s 0.35s
Top_k=6 (original) 95.73% (157/164) 64.06s 0.39s
  • Identical accuracy on code generation — same 157/164 pass rate.
  • ~11–13% faster generation (top_k=4 uses ~33% fewer activated experts per forward pass).

Problem-Level Error Analysis

Both configurations fail on the same 4 problems (has_close_elements, decode_cyclic, is_nested, order_by_points), suggesting these are inherent model capability limitations rather than routing artifacts.

Group Count Problems
Both fail 4 HumanEval/0, /38, /132, /145
top_k=4 only fails 3 HumanEval/50 (decode_shift), /94 (skjkasdkd), /116 (sort_array)
top_k=6 only fails 3 HumanEval/65 (circular_shift), /129 (minPath), /160 (do_algebra)

MMLU-Pro (Accuracy)

Configuration Accuracy Generation Time
Top_k=4 (this model) 41.46% (4988/12032) 78.24s
Top_k=6 (original) 37.77% (4545/12032) 85.16s
  • +3.69 percentage points higher accuracy across 12,032 questions
  • ~8% faster generation

Category Breakdown

Category top_k=4 top_k=6 Delta
biology 68.62% (492/717) 72.66% (521/717) −4.04pp
business 39.04% (308/789) 21.67% (171/789) +17.36pp
chemistry 14.58% (165/1132) 7.16% (81/1132) +7.42pp
computer science 47.80% (196/410) 44.63% (183/410) +3.17pp
economics 66.35% (560/844) 65.05% (549/844) +1.30pp
engineering 25.39% (246/969) 13.21% (128/969) +12.18pp
health 59.54% (487/818) 63.08% (516/818) −3.55pp
history 50.13% (191/381) 59.58% (227/381) −9.45pp
law 33.51% (369/1101) 35.88% (395/1101) −2.36pp
math 28.13% (380/1351) 15.47% (209/1351) +12.66pp
other 55.09% (509/924) 56.71% (524/924) −1.62pp
philosophy 53.91% (269/499) 55.71% (278/499) −1.80pp
physics 20.32% (264/1299) 14.55% (189/1299) +5.77pp
psychology 69.17% (552/798) 71.93% (574/798) −2.76pp

Key observations:

  • top_k=4 dominates STEM and business: business (+17.36pp), math (+12.66pp), engineering (+12.18pp), chemistry (+7.42pp), physics (+5.77pp), computer science (+3.17pp). These categories require precise numerical computation, formula derivation, or logical reasoning — activating fewer experts produces more stable outputs.
  • top_k=6 leads modestly in humanities/life sciences: history (+9.45pp), biology (+4.04pp), health (+3.55pp), psychology (+2.76pp), law (+2.36pp), philosophy (+1.80pp). These categories rely more on knowledge recall and semantic understanding.
  • Net advantage: top_k=4 correctly answers 1040 questions that top_k=6 gets wrong, while top_k=6 only answers 597 questions that top_k=4 misses — a 1.74× advantage for top_k=4.

Confidence Analysis

top_k=4 consistently produces cleaner output on multiple-choice questions — it is more likely to emit a single letter answer (A-J) directly, whereas top_k=6 occasionally generates verbose or malformed responses that fail to match the extraction regex. This contributes partially to the accuracy gap.

Error Intersection Map

                  Both correct        top_k=4 ✓, top_k=6 ✗
                      3948                   1040
                  ┌──────────────┐   ┌──────────────┐
                  │              │   │ math:    200  │
                  │              │   │ business:162  │
                  │              │   │ eng:     150  │
                  │              │   │ physics: 104  │
                  │              │   │ chem:    101  │
                  │              │   │ ...           │
                  └──────────────┘   └──────────────┘
                  Both wrong          top_k=6 ✓, top_k=4 ✗
                      6447                    597
                  ┌──────────────┐   ┌──────────────┐
                  │              │   │ other:   75   │
                  │              │   │ law:     63   │
                  │              │   │ health:  63   │
                  │              │   │ econ:    58   │
                  │              │   │ biology: 54   │
                  │              │   │ ...           │
                  └──────────────┘   └──────────────┘

Speed Analysis

Phase top_k=4 top_k=6 Delta
Model load 23.85s 26.52s +2.67s
Engine init 173.65s 185.14s +11.49s
Generation (HumanEval) 56.83s 64.06s +7.23s (+12.7%)
Generation (MMLU-Pro) 78.24s 85.16s +6.92s (+8.8%)

top_k=6 activates 50% more experts per token but wall-clock generation time increases by only ~9–13%, confirming that GPU compute and memory bandwidth are partially overlapped.

Summary

  • top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
  • The improvement is particularly pronounced on math, engineering, business, chemistry, and physics reasoning tasks
  • The original top_k=6 provides marginal benefits only in humanities/life sciences categories
  • For production deployment, top_k=4 is the recommended configuration

Full evaluation reports, scripts, and raw results are available in the eval/ directory of this repository.

Model Downloads

Model #Total Params #Activated Params Context Length Precision Download
DeepSeek-V4-Flash (original) 284B ~13B (top_k=6) 1M FP4 + FP8 Mixed HuggingFace
DeepSeek-V4-Flash-4E (this) 284B ~11B (top_k=4) 1M FP4 + FP8 Mixed HuggingFace

Chat Template

This release does not include a Jinja-format chat template. Instead, the encoding/ folder provides Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding/README.md for full documentation.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("autotrust/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)

Note: This encoding script is only needed when using the model through HuggingFace Transformers or vLLM directly. Inference engines that natively support the DeepSeek-V4 chat format (e.g., ds4) handle prompt construction internally and do not require it.

How to Run Locally

Please refer to the inference/ folder for detailed instructions on running DeepSeek-V4 locally using the official DeepSeek inference code, including model weight conversion and interactive chat demos.

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Contact

If you have any questions, please raise an issue on HuggingFace.

Downloads last month
-
Safetensors
Model size
158B params
Tensor type
BF16
·
I64
·
F32
·
F8_E8M0
·
F8_E4M3
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support