Qwen/Qwen-7B-Chat · 调用 QWenTokenizer.convert_tokens_to_string() 报缺失 byte

Aug 3, 2023

当使用

tokenizer.convert_tokens_to_string([k])

的时候，会产生一下错误，导致无法执行：

convert_tokens_to_string(b'ictionary') failed: 'QWenTokenizer' object has no attribute 'byte_decoder'

经过翻阅代码发现：

https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/tokenization_qwen.py#L197-L206

    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        """
        Converts a sequence of tokens in a single string. The most simple way to do it is `" ".join(tokens)` but we
        often want to remove sub-word tokenization artifacts at the same time.
        """
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode(
            "utf-8", errors=self.errors
        )
        return text

其中确实是用了 self.byte_decoder[c]，但是无论是 QWenTokenizer 还是 PreTrainedTokenizer 都没有这个变量。

jklj077

Qwen org Aug 8, 2023

Thank you for raising this issue. This has been fixed, please try

>>> tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True, force_download=True)
>>> tokenizer.convert_tokens_to_string([b'ictionary'])
'ictionary'

I'll close this for now. If there are other problems, please open a new one.

jklj077 changed discussion status to closed Aug 8, 2023

Qwen
/

Qwen-7B-Chat

调用 QWenTokenizer.convert_tokens_to_string() 报缺失 byte_decoder