Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
- SGLang
How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
I'm confused (v4 Flash vs v3.2)
I'm not sure if this is because of the new math magic under the hood or just post-training (maybe too much training data from GPT-5's fake sycophancy?), but the model feels more positively toxic, lazy, and on its own mind.
v3.2 followed custom reply formats and fairly complex rules pretty well, 7.5/10 of the cases, I would say. Without much "interest or passion" in the process, but at least on point.
v4 behaves like a lazy junior, ignores the instructions both in the System Prompt and at the top of the Context. The output feels like a minimal "good enough", but under the model's own vision. If you catch it and ask about its logic and reasons, it will just reply with an "Oopsy, you are absolutely right! I decided to lean into my own reasoning despite the explicit guidelines. My bad, let's do this correctly." (And repeats the output incorrectly anyway. Honestly, I would fire such a worker xD)
Despite the massive 1m context, the v4 cannot follow 3k of guidelines. And I'm horrified to think how it will behave with larger projects and agentics where precision and predictability are paramount.
Another case (Creative Writing):
- v3.2 already had problems with following the System Prompt and Tone Examples/Guidelines, constantly leaning into neutral, sanitized, and positive patterns, trying to force circumstances to save the protagonist from troubles. At least, it was kinda steerable by placing the tone stuff closer to the end of the context.
- v4, again, is on its own mind, absolutely out of control with tons of guardrails and toxic positive biases. Same as with the guidelines, it prioritizes its own logic instead of doing what is ordered. You need to catch and redirect it CONSTANTLY.
Also, maybe because this is a smaller model, but its creativity is kinda lacking.
There was a fun example with a Text Adventure scenario, where 20(!) retries of an ability unlock granted the same "vision/sense" crap in different flavours because the model decided that this would be the most "beneficial" and "balanced" at the moment... despite dozens of instructions and examples to go crazy with abilities. Something v3.2 and other models do nearly flawlessly.
I tested both v4s, and I can say that I'm tired, boss.
v4 kills all interest in using it. You need to constantly fight it, and then the results are just not worth it.
And I'm not talking about censorship or something illegal here. No, that's basically what the previous model can do pretty well.
I hope that my little rant will help to pinpoint the problem for future releases.
when you write v4, you need to specify which version you are referring to, because the v4 flash that you are writing about and that is enabled by default in the deepseek.com interface has 284B parameters and only 13B active parameters, which is more than twice as few as the previous v3 model. In this regard, it is not surprising that a model that is so scaled down does not outperform the previous model, which was twice as large.