Instructions to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- SGLang
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Tokenizer config may cause incorrect detokenization in vLLM/Transformers: `Ġ` and `Ċ` appear in generated text
Hi, I encountered a tokenizer/detokenization issue when serving deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with vLLM.
Problem
When I deploy the model with vLLM and call the OpenAI-compatible /v1/chat/completions endpoint, the returned message.content contains byte-level token artifacts such as Ġ and Ċ.
Example actual output:
ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek.ĠForĠcomprehensiveĠdetailsĠaboutĠourĠmodelsĠandĠproducts,ĠweĠinviteĠyouĠtoĠconsultĠourĠofficialĠdocumentation.
Expected output:
Hello! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. For comprehensive details about our models and products, we invite you to consult our official documentation.
The same issue also appears in the reasoning output:
ĊAlright,ĠletĠmeĠtryĠtoĠfigureĠoutĠwhat'sĠgoingĠon...
Environment
Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Serving framework: vLLM
Transformers version: 5.7.0
Tokenizers version: 0.22.2
vLLM command:
vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name deepseek-r1-distill-qwen-7b \
--trust-remote-code \
--dtype auto \
--max-model-len 16384 \
--reasoning-parser deepseek_r1
Request:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-qwen-7b",
"messages": [
{"role": "user", "content": "你好,简单介绍一下你自己"}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 1024
}'
Actual response contains:
{
"message": {
"role": "assistant",
"content": "ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek...",
"reasoning": "ĊĊ"
}
}
Investigation
I checked the local model files:
config.json exists
tokenizer_config.json exists
tokenizer.json exists
The model config is:
config.model_type: qwen2
config.architectures: ['Qwen2ForCausalLM']
The tokenizer config contains:
tokenizer_class: LlamaTokenizerFast
model_max_length: 16384
has_chat_template: True
When loading the tokenizer through AutoTokenizer, it appears to use the LLaMA tokenizer path, and detokenization is incorrect:
from transformers import AutoTokenizer
model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"
tok = AutoTokenizer.from_pretrained(
model_path,
trust_remote_code=True,
use_fast=True,
)
print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))
s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)
print("tokens:", tok.convert_ids_to_tokens(ids)[:20])
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))
Output:
class: <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>
is_fast: True
tokens: ['Hello', '!I', "'m", 'Deep', 'Seek', '-R', '1', ',', 'an', 'art', 'ificial', 'intelligence', 'assistant', '.']
decoded: "Hello!I'mDeepSeek-R1,anartificialintelligenceassistant."
However, if I explicitly load Qwen2TokenizerFast, decoding becomes correct:
from transformers import Qwen2TokenizerFast
model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"
tok = Qwen2TokenizerFast.from_pretrained(
model_path,
trust_remote_code=True,
)
s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)
print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))
This produces the expected decoded text with spaces and newlines preserved.
Workaround
Changing the following field in tokenizer_config.json:
"tokenizer_class": "LlamaTokenizerFast"
to:
"tokenizer_class": "Qwen2TokenizerFast"
fixes the detokenization issue in my environment.
After patching the tokenizer config and starting vLLM with the patched tokenizer directory:
vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
--served-model-name deepseek-r1-distill-qwen-7b \
--trust-remote-code \
--dtype auto \
--max-model-len 16384 \
--tokenizer /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B-qwen2tok \
--tokenizer-mode auto \
--reasoning-parser deepseek_r1
the output no longer contains Ġ or Ċ.
Question
Since this model has:
model_type: qwen2
architectures: ['Qwen2ForCausalLM']
should tokenizer_config.json use:
"tokenizer_class": "Qwen2TokenizerFast"
instead of:
"tokenizer_class": "LlamaTokenizerFast"
?
Or is the current configuration expected and only compatible with specific Transformers/vLLM versions?
Thanks!