Add projection dim to text and vision model configs for CLIPVisionModelWithProjection and CLIPTextModelWithProjection support

#6

The default projection_dim is 512 which will throw an error when loading weights for

from transformers import CLIPVisionModelWithProjection
CLIPVisionModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')

or

from transformers import CLIPTextModelWithProjection
CLIPTextModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')

Loading CLIPModel will not throw an error because it uses the projection_dim on the top level of the config.

from transformers import CLIPModel
CLIPModel.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')

Testing the PR:

from transformers import CLIPVisionModelWithProjection, CLIPTextModelWithProjection, CLIPModel

CLIPVisionModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")

CLIPTextModelWithProjection.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")

CLIPModel.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K', revision="refs/pr/6")
rwightman changed pull request status to merged

Sign up or log in to comment