Overflow of tokenizer caused by problematic model_max_len

#16
by wanghaofan - opened

To reproduce the issue, run following lines in the command

from transformers import BertForSequenceClassification, BertConfig, BertTokenizer

pretrained_model_name_or_path = "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese"
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)

captions = ["一只猫", '测试']

inputs = tokenizer(captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True)

If padding mode is set to 'max_length', it will raise overflow. Actually, I think this is a typo once you print out tokenizer.model_max_len, which is quite abnormal. To solve it, just set tokenizer.model_max_len to a small number such as 77.

Fengshenbang-LM org

we did not set the model_max_length in the config file of "IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese" , so by default it will be a very large number.
We fix it when training stable diffusion and manually set it to 512. you can try this:

tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1", subfolder="tokenizer")
tokenizer.model_max_length
wuxiaojun changed discussion status to closed

Sign up or log in to comment