laion/CLIP-ViT-H-14-laion2B-s32B-b79K · `model_max_length` might be missing from the `tokenizer

Nov 18, 2022

•

edited Nov 18, 2022

New to HF and total beginner with any language models. Not sure if this is a feature or a bug tho. Hope this might help.
I bumped into an issue that calling

clip_model_name: str = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
clip_model = CLIPModel.from_pretrained(clip_model_name, local_files_only=True)
clip_tokenizer = AutoTokenizer.from_pretrained(clip_model_name)

often returns errors of tensor shape not matching, when calling clip_tokenizer. This isn't an issue for other Openai/CLIP models. As they all have model_max_length: 77 in the tokenizer_config.json.

my temporary workaround:
token_features = clip_tokenizer([something], return_tensors="pt", truncation=True, max_length=77)

rom1504

LAION eV org Nov 18, 2022

@rwightman would you know how to fix this ?

fischcheng changed discussion title from max_length might be missing from the `tokenizer_config.json` to `model_max_length` might be missing from the `tokenizer_config.json` Nov 18, 2022

rwightman

LAION eV org Nov 24, 2022

The tokenizer config was taken from https://huggingface.co/openai/clip-vit-base-patch32/blob/main/tokenizer_config.json ... so suprised that one works and this one doesn't, perhaps that specific OpenAI CLIP model was missing the max length... will look closer