Add projection dim to text and vision model configs for CLIPVisionModelWithProjection and CLIPTextModelWithProjection support


The default projection_dim is 512 which will throw an error when loading weights for

from transformers import CLIPVisionModelWithProjection


from transformers import CLIPTextModelWithProjection

Loading CLIPModel will not throw an error because it uses the projection_dim on the top level of the config.

from transformers import CLIPModel

Testing the PR:

from transformers import CLIPVisionModelWithProjection, CLIPTextModelWithProjection, CLIPModel

CLIPVisionModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")

CLIPTextModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")

CLIPModel.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")
rwightman changed pull request status to merged

Sign up or log in to comment