调用 QWenTokenizer.convert_tokens_to_string() 报缺失 byte_decoder

#2
by twang2218 - opened

当使用

tokenizer.convert_tokens_to_string([k])

的时候,会产生一下错误,导致无法执行:

convert_tokens_to_string(b'ictionary') failed: 'QWenTokenizer' object has no attribute 'byte_decoder'

经过翻阅代码发现:

https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/tokenization_qwen.py#L197-L206

    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        """
        Converts a sequence of tokens in a single string. The most simple way to do it is `" ".join(tokens)` but we
        often want to remove sub-word tokenization artifacts at the same time.
        """
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode(
            "utf-8", errors=self.errors
        )
        return text

其中确实是用了 self.byte_decoder[c],但是无论是 QWenTokenizer 还是 PreTrainedTokenizer 都没有这个变量。

Qwen org

Thank you for raising this issue. This has been fixed, please try

>>> tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, force_download=True)
>>> tokenizer.convert_tokens_to_string([b'ictionary'])
'ictionary'

I'll close this for now. If there are other problems, please open a new one.

jklj077 changed discussion status to closed

Sign up or log in to comment