Tokenization

注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。

Qwen-7B采用UTF-8字节级别的BPE tokenization方式，并依赖tiktoken这一高效的软件包执行分词。 Qwen-7B中有两类token，即源于BPE、bytes类型的普通token和特殊指定、str类型的特殊token。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)

普通token

普通token源于BPE，是在UTF-8编码的文本字节序列上学习得到的。尽管基于字节序列的方式保证了所有文本均可被tokenize且没有未登录token问题，但处理罕见文本时有可能回退到字节级别的编码。由于从字节序列解码为文本时，errors参数设为replace，处理不完整的token序列可能会遇到UTF-8解码错误，表象是生成中包含“替换字符”(�)。这一行为可以通过将errors参数设为ignore来规避。一次性修改可以传入tokenizer的decode函数，持久性修改可以传入tokenizer的初始化函数，请注意decode的配置优先级更高。 errors的可选值，请参阅Python文档.

>>> tokenizer.decode([51461])
' �'

>>> tokenizer.convert_ids_to_tokens([51461])
[b' \xe6\xa0']

>>> b' \xe6\xa0'.decode("utf-8", errors='replace')
' �'

>>> tokenizer.decode([51461, 117])
' 根'

>>> tokenizer.convert_ids_to_tokens([51461, 117])
[b' \xe6\xa0', b'\xb9']

>>> b' \xe6\xa0\xb9'.decode("utf-8", errors='replace')
' 根'

bytes类型的普通token到id的映射可以通过tokenizer.get_vocab()获取。尚不支持也不推荐向tokenizer增加普通token。

特殊token

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, pad_token='<|endoftext|>')

注意: 对于提供的训练好的模型，设置诸如bos、eos、unk之类的没有意义，即模型不需要这些概念。如果设置了这些token，但没有相应的微调这些token以让模型理解其含义，未知行为可能被触发。特别时，不应混淆<|endoftext|>和eos的概念，除非应用场景中它们的实际含义是一致的，即句子末尾等价于文本末尾。

注入攻击防御

由于特殊token和普通token概念上的差异，如果输入文本中含有特殊token的字面表达该如何处理？以下面文本为例

print("<|endoftext|>")

其正确的tokenization为

ids:[1350, 9639, 91, 8691, 723, 427, 91, 82598]
tokens: [b'print', b'("<', b'|', b'endo', b'ft', b'ext', b'|', b'>")']

不是

ids: [1350, 445, 151643, 899]
tokens: [b'print', b'("', '<|endoftext|>', b'")']

默认行为曾是正确的，即输入文本中任何字符一律按普通token处理，特殊token应由开发者在tokenization人工处理。然后，这与社区中的实践似有差异，为开发者复用代码增加了额外适配步骤。

默认行为已被调整为从输入文本中解析特殊token的字面表达。如需启用注入攻击防御，请传入参数allowed_special=set()：

>>> tokenizer('print("<|endoftext|>")', allowed_special=set())
{'input_ids': [1350, 9639, 91, 8691, 723, 427, 91, 82598], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

这一行为可以更精细的调控，将allowed_special设计为str的集合即可：

>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'})
{'input_ids': [1350, 9639, 91, 15460, 62, 15, 91, 82598, 151643], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

如果希望输入中遇到特殊token的字面表达时，获得更直接的提醒，通过配置disallowed_special可以让tokenizer直接触发异常：

>>> tokenizer('print("<|extra_0|>")<|endoftext|>', allowed_special={'<|endoftext|>'}, disallowed_special=('<|extra_0|>', ))
...
ValueError: Encountered text corresponding to disallowed special token '<|extra_0|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|extra_0|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|extra_0|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

更多关于allowed_special和disallowed_special的信息, 请参阅tiktoken代码.

新的默认行为与以下设定等价

>>> tokenizer('print("<|endoftext|>")', allowed_special="all", disallowed_special=())
{'input_ids': [1350, 445, 151643, 899], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}