THUDM/glm-4-9b-chat · UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

Jul 22, 2024

•

edited Jul 22, 2024

我本地用text-generation-webui的transformer跑glm-4-9b-chat的时候，输出到一些特定字符就会有[UnicodeDecodeError](UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data)打断输出的情况。但我跑其他的本地模型都没有这个问题

我测试后发现至少“嚎”、“啕”、“讪”、“匕”、“奂”、“嗔”、“蝼”、“缨”’、“嚓”、“猾”、“狰”、“狞”、“囔”、“饴”、“啬”、“岌”这些字，还有“╒”这种特殊字符是稳定打断输出的，还有很多其他的字没试出来。还有特殊字符💰也会中断输出，但后台显示错位略有区别：“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: unexpected end of data”

例如，我在text-generation-webui的defeat选项卡，Prompt选none，在input里直接输入：

请重复一次以下语句：“你好，我是GLM-4-9b。美轮美奂，叹为观止。”

raw格式下得到的output为：

请重复一次以下语句：“你好，我是GLM-4-9b。美轮美奂，叹为观止。”你好，我是GLM-4-9b。美轮美

就会在输出“奂”字之前显示“UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data”，并且中断输出。

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 23, 2024

这些字会不会tiktoken在解码的时候无法正常打印，感觉是模型实现有点问题。我们会检查一下

Tyler-f

Jul 24, 2024

这些字会不会tiktoken在解码的时候无法正常打印，感觉是模型实现有点问题。我们会检查一下

好的，感觉是一些使用频率不太高的字会解码出错

zRzRzRzRzRzRzR changed discussion status to closed Aug 11, 2024

Tyler-f

Aug 17, 2024

•

edited Aug 17, 2024

我看状态已经closed了，是问题解决了吗？

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Aug 19, 2024

•

edited Aug 19, 2024

这个问题暂时没有办法解决，但是不是bug，因为使用tiktoken，这个是bpe的方案，有些中文字打不出来是正常的，因为这个不是utf8，目前的模型实现代码就是打印不了的，是正常现象

Tyler-f

Aug 20, 2024

这个问题暂时没有办法解决，但是不是bug，因为使用tiktoken，这个是bpe的方案，有些中文字打不出来是正常的，因为这个不是utf8，目前的模型实现代码就是打印不了的，是正常现象

好的，了解了。那有没有替代解决方案，比如输出到无法打印的字符，就换成英文字符输出

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Aug 20, 2024

这个目前的方案是直接编码在解码就行了

Tyler-f

Aug 20, 2024

这个目前的方案是直接编码在解码就行了

能稍微说说该怎么操作吗

GralchemOz

Aug 30, 2024

•

edited Aug 30, 2024

这个目前的方案是直接编码在解码就行了

能稍微说说该怎么操作吗

这个问题其实是tgw的问题，在bpe方案下面，模型单次输出的可能只是某个汉字的一部分，比如'簌'这个汉字，utf8编码为b'\xe7\xb0\xa8'，而模型单次输出后只输出了b'\xe7\xb0’，需要下一轮输出时才会输出完整的编码。而tgw下，每次模型输出都会解码，所以会报错。只有修改了tgw源码的逻辑，才能解决这个问题

修改起来也很容易，在tgw目录下，modules\text_generation.py文件第270行，get_reply_from_output_id函数修改为下面形式：

def get_reply_from_output_ids(output_ids, state=None, starting_from=0):
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)

    # Handle tokenizers that do not add the leading space for the first token
    if (hasattr(shared.tokenizer, 'convert_ids_to_tokens') and len(output_ids) > starting_from) and not reply.startswith(' '):
        first_token = shared.tokenizer.convert_ids_to_tokens(int(output_ids[starting_from]))
        if isinstance(first_token, (bytes,)):
           #处理异常
            try:
                first_token = first_token.decode('utf8')
           #捕获到UnicodeDecodeError，则本轮解码为空字符
            except UnicodeDecodeError:
                first_token = ''

        if first_token.startswith('▁'):
            reply = ' ' + reply

    return reply

即，加一个异常处理就行了，在遇到UnicodeDecodeError时，本轮输出为空

Tyler-f

Aug 31, 2024

这个目前的方案是直接编码在解码就行了

能稍微说说该怎么操作吗

这个问题其实是tgw的问题，在bpe方案下面，模型单次输出的可能只是某个汉字的一部分，比如'簌'这个汉字，utf8编码为b'\xe7\xb0\xa8'，而模型单次输出后只输出了b'\xe7\xb0’，需要下一轮输出时才会输出完整的编码。而tgw下，每次模型输出都会解码，所以会报错。只有修改了tgw源码的逻辑，才能解决这个问题

修改起来也很容易，在tgw目录下，modules\text_generation.py文件第270行，get_reply_from_output_id函数修改为下面形式：
def get_reply_from_output_ids(output_ids, state=None, starting_from=0):
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)

    # Handle tokenizers that do not add the leading space for the first token
    if (hasattr(shared.tokenizer, 'convert_ids_to_tokens') and len(output_ids) > starting_from) and not reply.startswith(' '):
        first_token = shared.tokenizer.convert_ids_to_tokens(int(output_ids[starting_from]))
        if isinstance(first_token, (bytes,)):
           #处理异常
            try:
                first_token = first_token.decode('utf8')
           #捕获到UnicodeDecodeError，则本轮解码为空字符
            except UnicodeDecodeError:
                first_token = ''

        if first_token.startswith('▁'):
            reply = ' ' + reply

    return reply
即，加一个异常处理就行了，在遇到UnicodeDecodeError时，本轮输出为空

了解了原因之后就很好处理了，感谢解答