Instructions to use autotrust/DeepSeek-V4-Flash-4E with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use autotrust/DeepSeek-V4-Flash-4E with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="autotrust/DeepSeek-V4-Flash-4E")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("autotrust/DeepSeek-V4-Flash-4E") model = AutoModelForCausalLM.from_pretrained("autotrust/DeepSeek-V4-Flash-4E") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use autotrust/DeepSeek-V4-Flash-4E with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "autotrust/DeepSeek-V4-Flash-4E" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "autotrust/DeepSeek-V4-Flash-4E", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/autotrust/DeepSeek-V4-Flash-4E
- SGLang
How to use autotrust/DeepSeek-V4-Flash-4E with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "autotrust/DeepSeek-V4-Flash-4E" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "autotrust/DeepSeek-V4-Flash-4E", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "autotrust/DeepSeek-V4-Flash-4E" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "autotrust/DeepSeek-V4-Flash-4E", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use autotrust/DeepSeek-V4-Flash-4E with Docker Model Runner:
docker model run hf.co/autotrust/DeepSeek-V4-Flash-4E
DeepSeek-V4-Flash-4E
A fine-tuned variant of DeepSeek-V4-Flash with top k=4 for optimal inference efficiency.
HuggingFace: autotrust/DeepSeek-V4-Flash-4E
Released by AutoTrust AI Lab · Adapted by Hai Yu (cloudyu)
What is DeepSeek-V4-Flash-4E?
DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts (MoE) language model with 13B activated parameters, supporting a context length of one million tokens. The original model uses num_experts_per_tok=6 by default.
DeepSeek-V4-Flash-4E is a post-processed variant of the same model with the number of activated experts per token reduced from 6 → 4, while keeping all other weights identical. This change:
- Reduces inference compute by ~33% (fewer active experts per forward pass)
- Improves generation throughput by ~8–11%
- Maintains or improves accuracy on both code generation and knowledge benchmarks
- Uses the same FP4 + FP8 mixed precision format as the original
Why top_k=4 Instead of 6?
The original num_experts_per_tok=6 is not a power of 2. In practice, this means:
- GPU tensor core utilization is suboptimal for certain MoE dispatch shapes
- Memory alignment and warp scheduling are less efficient compared to power-of-2 expert counts
- The routing decision per token requires computing softmax over 6 logits instead of 4, introducing unnecessary overhead
Setting top_k to 4 (a power of 2) gives the GPU's SIMT architecture a natural alignment for expert dispatch and attention masking, while activating 33% fewer parameters per token with no accuracy degradation—and in many reasoning-heavy tasks, a measurable accuracy improvement.
Key Changes from the Original
| Configuration | Original (top_k=6) | This Model (top_k=4) |
|---|---|---|
num_experts_per_tok |
6 | 4 |
| Activated params | ~13B | ~11B |
| Total params | 284B | 284B |
| Routing method | noaux_tc |
noaux_tc |
| All other weights | identical | identical |
The tid2eid (expert routing) weight tensors have been reshaped from [vocab_size, 6] to [vocab_size, 4] — only the first 4 columns are retained, matching the original training distribution order. No additional training or fine-tuning was performed; this is purely an inference-time configuration change.
Independent Evaluation Results
Test Environment
| Item | Value |
|---|---|
| Model | DeepSeek-V4-Flash (284B MoE, FP4+FP8 mixed precision) |
| Engine | vLLM 0.23.0 |
| GPU | Single NVIDIA B300 (274 GB) |
| KV Cache dtype | fp8 |
| Sampling | temperature=0.0, top_p=0.95 |
| Stop token | <|end▁of▁sentence|> |
| Chat format | encoding_dsv4.py, chat mode for MMLU-Pro, thinking mode for HumanEval |
HumanEval (Pass@1)
| Configuration | Pass@1 | Generation Time | Time per Sample |
|---|---|---|---|
| Top_k=4 (this model) | 95.73% (157/164) | 56.83s | 0.35s |
| Top_k=6 (original) | 95.73% (157/164) | 64.06s | 0.39s |
- Identical accuracy on code generation — same 157/164 pass rate.
- ~11–13% faster generation (top_k=4 uses ~33% fewer activated experts per forward pass).
Problem-Level Error Analysis
Both configurations fail on the same 4 problems (has_close_elements, decode_cyclic, is_nested, order_by_points), suggesting these are inherent model capability limitations rather than routing artifacts.
| Group | Count | Problems |
|---|---|---|
| Both fail | 4 | HumanEval/0, /38, /132, /145 |
| top_k=4 only fails | 3 | HumanEval/50 (decode_shift), /94 (skjkasdkd), /116 (sort_array) |
| top_k=6 only fails | 3 | HumanEval/65 (circular_shift), /129 (minPath), /160 (do_algebra) |
MMLU-Pro (Accuracy)
| Configuration | Accuracy | Generation Time |
|---|---|---|
| Top_k=4 (this model) | 41.46% (4988/12032) | 78.24s |
| Top_k=6 (original) | 37.77% (4545/12032) | 85.16s |
- +3.69 percentage points higher accuracy across 12,032 questions
- ~8% faster generation
Category Breakdown
| Category | top_k=4 | top_k=6 | Delta |
|---|---|---|---|
| biology | 68.62% (492/717) | 72.66% (521/717) | −4.04pp |
| business | 39.04% (308/789) | 21.67% (171/789) | +17.36pp |
| chemistry | 14.58% (165/1132) | 7.16% (81/1132) | +7.42pp |
| computer science | 47.80% (196/410) | 44.63% (183/410) | +3.17pp |
| economics | 66.35% (560/844) | 65.05% (549/844) | +1.30pp |
| engineering | 25.39% (246/969) | 13.21% (128/969) | +12.18pp |
| health | 59.54% (487/818) | 63.08% (516/818) | −3.55pp |
| history | 50.13% (191/381) | 59.58% (227/381) | −9.45pp |
| law | 33.51% (369/1101) | 35.88% (395/1101) | −2.36pp |
| math | 28.13% (380/1351) | 15.47% (209/1351) | +12.66pp |
| other | 55.09% (509/924) | 56.71% (524/924) | −1.62pp |
| philosophy | 53.91% (269/499) | 55.71% (278/499) | −1.80pp |
| physics | 20.32% (264/1299) | 14.55% (189/1299) | +5.77pp |
| psychology | 69.17% (552/798) | 71.93% (574/798) | −2.76pp |
Key observations:
- top_k=4 dominates STEM and business: business (+17.36pp), math (+12.66pp), engineering (+12.18pp), chemistry (+7.42pp), physics (+5.77pp), computer science (+3.17pp). These categories require precise numerical computation, formula derivation, or logical reasoning — activating fewer experts produces more stable outputs.
- top_k=6 leads modestly in humanities/life sciences: history (+9.45pp), biology (+4.04pp), health (+3.55pp), psychology (+2.76pp), law (+2.36pp), philosophy (+1.80pp). These categories rely more on knowledge recall and semantic understanding.
- Net advantage: top_k=4 correctly answers 1040 questions that top_k=6 gets wrong, while top_k=6 only answers 597 questions that top_k=4 misses — a 1.74× advantage for top_k=4.
Confidence Analysis
top_k=4 consistently produces cleaner output on multiple-choice questions — it is more likely to emit a single letter answer (A-J) directly, whereas top_k=6 occasionally generates verbose or malformed responses that fail to match the extraction regex. This contributes partially to the accuracy gap.
Error Intersection Map
Both correct top_k=4 ✓, top_k=6 ✗
3948 1040
┌──────────────┐ ┌──────────────┐
│ │ │ math: 200 │
│ │ │ business:162 │
│ │ │ eng: 150 │
│ │ │ physics: 104 │
│ │ │ chem: 101 │
│ │ │ ... │
└──────────────┘ └──────────────┘
Both wrong top_k=6 ✓, top_k=4 ✗
6447 597
┌──────────────┐ ┌──────────────┐
│ │ │ other: 75 │
│ │ │ law: 63 │
│ │ │ health: 63 │
│ │ │ econ: 58 │
│ │ │ biology: 54 │
│ │ │ ... │
└──────────────┘ └──────────────┘
Speed Analysis
| Phase | top_k=4 | top_k=6 | Delta |
|---|---|---|---|
| Model load | 23.85s | 26.52s | +2.67s |
| Engine init | 173.65s | 185.14s | +11.49s |
| Generation (HumanEval) | 56.83s | 64.06s | +7.23s (+12.7%) |
| Generation (MMLU-Pro) | 78.24s | 85.16s | +6.92s (+8.8%) |
top_k=6 activates 50% more experts per token but wall-clock generation time increases by only ~9–13%, confirming that GPU compute and memory bandwidth are partially overlapped.
Summary
- top_k=4 wins in all practical metrics: higher or equal accuracy, faster inference, lower memory bandwidth usage
- The improvement is particularly pronounced on math, engineering, business, chemistry, and physics reasoning tasks
- The original top_k=6 provides marginal benefits only in humanities/life sciences categories
- For production deployment, top_k=4 is the recommended configuration
Full evaluation reports, scripts, and raw results are available in the
eval/directory of this repository.
Model Downloads
| Model | #Total Params | #Activated Params | Context Length | Precision | Download |
|---|---|---|---|---|---|
| DeepSeek-V4-Flash (original) | 284B | ~13B (top_k=6) | 1M | FP4 + FP8 Mixed | HuggingFace |
| DeepSeek-V4-Flash-4E (this) | 284B | ~11B (top_k=4) | 1M | FP4 + FP8 Mixed | HuggingFace |
Chat Template
This release does not include a Jinja-format chat template. Instead, the encoding/ folder provides Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding/README.md for full documentation.
A brief example:
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "user", "content": "hello"},
{"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
{"role": "user", "content": "1+1=?"}
]
# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")
# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("autotrust/DeepSeek-V4-Flash-4E")
tokens = tokenizer.encode(prompt)
Note: This encoding script is only needed when using the model through HuggingFace Transformers or vLLM directly. Inference engines that natively support the DeepSeek-V4 chat format (e.g., ds4) handle prompt construction internally and do not require it.
How to Run Locally
Please refer to the inference/ folder for detailed instructions on running DeepSeek-V4 locally using the official DeepSeek inference code, including model weight conversion and interactive chat demos.
For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.
License
This repository and the model weights are licensed under the MIT License.
Contact
If you have any questions, please raise an issue on HuggingFace.
- Downloads last month
- -