Bug of tokenize "<|endoftext|>"

by YeungNLP - opened Aug 4, 2023

Qwen org Aug 4, 2023

在对"<|endoftext|>"进行tokenize的时候，会将其切分成多个token，而不是151643这一个token。

运行脚本：

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
print('encode <|endoftext|>: {}'.format(tokenizer.encode('<|endoftext|>')))

分词结果为：

encode <|endoftext|>: [27, 91, 8691, 723, 427, 91, 29]

希望qwen的同学修复一下。

Aug 4, 2023

您好，这里的逻辑是为了防止被注入攻击，行为是符合预期的，可以参见https://github.com/QwenLM/Qwen-7B/issues/24 。
如有需要，可以手动拼好token_ids喂进模型进行训练，感谢您的关注。

jklj077

Qwen org Aug 8, 2023

感谢提出该问题！尽管该行为符合预期且默认情况下更为安全，但我们已更新代码。默认行为已调整为社区通行方案，以利使用。但我们仍建议您启用注入攻击相关防护。更多信息请参阅GitHub上的文档。

jklj077 changed discussion status to closed Aug 8, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment