Instructions to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

SGLang

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
```

Tokenizer config may cause incorrect detokenization in vLLM/Transformers: `Ġ` and `Ċ` appear in generated text

#22

by AidPaike - opened 23 days ago

Discussion

AidPaike

23 days ago

Hi, I encountered a tokenizer/detokenization issue when serving deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with vLLM.

Problem

When I deploy the model with vLLM and call the OpenAI-compatible /v1/chat/completions endpoint, the returned message.content contains byte-level token artifacts such as Ġ and Ċ.

Example actual output:

ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek.ĠForĠcomprehensiveĠdetailsĠaboutĠourĠmodelsĠandĠproducts,ĠweĠinviteĠyouĠtoĠconsultĠourĠofficialĠdocumentation.

Expected output:

Hello! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. For comprehensive details about our models and products, we invite you to consult our official documentation.

The same issue also appears in the reasoning output:

ĊAlright,ĠletĠmeĠtryĠtoĠfigureĠoutĠwhat'sĠgoingĠon...

Environment

Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Serving framework: vLLM
Transformers version: 5.7.0
Tokenizers version: 0.22.2

vLLM command:

vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
  --served-model-name deepseek-r1-distill-qwen-7b \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 16384 \
  --reasoning-parser deepseek_r1

Request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-qwen-7b",
    "messages": [
      {"role": "user", "content": "你好，简单介绍一下你自己"}
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 1024
  }'

Actual response contains:

{
  "message": {
    "role": "assistant",
    "content": "ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek...",
    "reasoning": "ĊĊ"
  }
}

Investigation

I checked the local model files:

config.json exists
tokenizer_config.json exists
tokenizer.json exists

The model config is:

config.model_type: qwen2
config.architectures: ['Qwen2ForCausalLM']

The tokenizer config contains:

tokenizer_class: LlamaTokenizerFast
model_max_length: 16384
has_chat_template: True

When loading the tokenizer through AutoTokenizer, it appears to use the LLaMA tokenizer path, and detokenization is incorrect:

from transformers import AutoTokenizer

model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"

tok = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
    use_fast=True,
)

print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))

s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)

print("tokens:", tok.convert_ids_to_tokens(ids)[:20])
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))

Output:

class: <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>
is_fast: True
tokens: ['Hello', '!I', "'m", 'Deep', 'Seek', '-R', '1', ',', 'an', 'art', 'ificial', 'intelligence', 'assistant', '.']
decoded: "Hello!I'mDeepSeek-R1,anartificialintelligenceassistant."

However, if I explicitly load Qwen2TokenizerFast, decoding becomes correct:

from transformers import Qwen2TokenizerFast

model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"

tok = Qwen2TokenizerFast.from_pretrained(
    model_path,
    trust_remote_code=True,
)

s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)

print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))

This produces the expected decoded text with spaces and newlines preserved.

Workaround

Changing the following field in tokenizer_config.json:

"tokenizer_class": "LlamaTokenizerFast"

to:

"tokenizer_class": "Qwen2TokenizerFast"

fixes the detokenization issue in my environment.

After patching the tokenizer config and starting vLLM with the patched tokenizer directory:

vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
  --served-model-name deepseek-r1-distill-qwen-7b \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 16384 \
  --tokenizer /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B-qwen2tok \
  --tokenizer-mode auto \
  --reasoning-parser deepseek_r1

the output no longer contains Ġ or Ċ.

Question

Since this model has:

model_type: qwen2
architectures: ['Qwen2ForCausalLM']

should tokenizer_config.json use:

"tokenizer_class": "Qwen2TokenizerFast"

instead of:

"tokenizer_class": "LlamaTokenizerFast"

Or is the current configuration expected and only compatible with specific Transformers/vLLM versions?

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment