模型支持最大文本长度为512token，请问一个token对应几个英文字母或中文汉字？

by kyonyan - opened Jan 23

Discussion

kyonyan

Jan 23

如题，谢谢。

Shitao

Beijing Academy of Artificial Intelligence org Jan 23

您好，一个token会对应多个字母或汉子，没有一个恒定的比例。
可以根据一下方法计算tokenizer后的长度：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
length = len(tokenizer("hello world")['input_ids'])

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment