Tokenizer config may cause incorrect detokenization in vLLM/Transformers: `Ġ` and `Ċ` appear in generated text

#22
by AidPaike - opened

Hi, I encountered a tokenizer/detokenization issue when serving deepseek-ai/DeepSeek-R1-Distill-Qwen-7B with vLLM.

Problem

When I deploy the model with vLLM and call the OpenAI-compatible /v1/chat/completions endpoint, the returned message.content contains byte-level token artifacts such as Ġ and Ċ.

Example actual output:

ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek.ĠForĠcomprehensiveĠdetailsĠaboutĠourĠmodelsĠandĠproducts,ĠweĠinviteĠyouĠtoĠconsultĠourĠofficialĠdocumentation.

Expected output:

Hello! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. For comprehensive details about our models and products, we invite you to consult our official documentation.

The same issue also appears in the reasoning output:

ĊAlright,ĠletĠmeĠtryĠtoĠfigureĠoutĠwhat'sĠgoingĠon...

Environment

Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Serving framework: vLLM
Transformers version: 5.7.0
Tokenizers version: 0.22.2

vLLM command:

vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
  --served-model-name deepseek-r1-distill-qwen-7b \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 16384 \
  --reasoning-parser deepseek_r1

Request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1-distill-qwen-7b",
    "messages": [
      {"role": "user", "content": "你好,简单介绍一下你自己"}
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 1024
  }'

Actual response contains:

{
  "message": {
    "role": "assistant",
    "content": "ĊĊHello!ĠI'mĠDeepSeek-R1,ĠanĠartificialĠintelligenceĠassistantĠcreatedĠbyĠDeepSeek...",
    "reasoning": "ĊĊ"
  }
}

Investigation

I checked the local model files:

config.json exists
tokenizer_config.json exists
tokenizer.json exists

The model config is:

config.model_type: qwen2
config.architectures: ['Qwen2ForCausalLM']

The tokenizer config contains:

tokenizer_class: LlamaTokenizerFast
model_max_length: 16384
has_chat_template: True

When loading the tokenizer through AutoTokenizer, it appears to use the LLaMA tokenizer path, and detokenization is incorrect:

from transformers import AutoTokenizer

model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"

tok = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
    use_fast=True,
)

print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))

s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)

print("tokens:", tok.convert_ids_to_tokens(ids)[:20])
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))

Output:

class: <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>
is_fast: True
tokens: ['Hello', '!I', "'m", 'Deep', 'Seek', '-R', '1', ',', 'an', 'art', 'ificial', 'intelligence', 'assistant', '.']
decoded: "Hello!I'mDeepSeek-R1,anartificialintelligenceassistant."

However, if I explicitly load Qwen2TokenizerFast, decoding becomes correct:

from transformers import Qwen2TokenizerFast

model_path = "/mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B"

tok = Qwen2TokenizerFast.from_pretrained(
    model_path,
    trust_remote_code=True,
)

s = "\n\nHello! I'm DeepSeek-R1, an artificial intelligence assistant."
ids = tok.encode(s)

print("class:", tok.__class__)
print("is_fast:", getattr(tok, "is_fast", None))
print("decoded:", repr(tok.decode(ids, skip_special_tokens=True)))

This produces the expected decoded text with spaces and newlines preserved.

Workaround

Changing the following field in tokenizer_config.json:

"tokenizer_class": "LlamaTokenizerFast"

to:

"tokenizer_class": "Qwen2TokenizerFast"

fixes the detokenization issue in my environment.

After patching the tokenizer config and starting vLLM with the patched tokenizer directory:

vllm serve /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B \
  --served-model-name deepseek-r1-distill-qwen-7b \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 16384 \
  --tokenizer /mnt/sda1/models/DeepSeek-R1-Distill-Qwen-7B-qwen2tok \
  --tokenizer-mode auto \
  --reasoning-parser deepseek_r1

the output no longer contains Ġ or Ċ.

Question

Since this model has:

model_type: qwen2
architectures: ['Qwen2ForCausalLM']

should tokenizer_config.json use:

"tokenizer_class": "Qwen2TokenizerFast"

instead of:

"tokenizer_class": "LlamaTokenizerFast"

?

Or is the current configuration expected and only compatible with specific Transformers/vLLM versions?

Thanks!

Sign up or log in to comment