Extracting `text_encoder` from `ViT-H-14` using `open_clip_torch`?

#9
by Chanuhf - opened

I've loaded the pre-trained CLIP model variant ViT-H-14 using open_clip_torch. While I can get the tokenizer with open_clip.get_tokenizer('ViT-H-14'), I'm unsure how to extract the text_encoder.

Can anyone guide me on obtaining the text_encoder from this model?

For example:

!pip install transformers
from transformers import CLIPTextModel, CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')
text_encoder = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14')

Expecting:

!pip install open_clip_torch

model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14',pretrained='laion2b_s32b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-H-14')
text_encoder = ________________________________
LAION eV org

@Chanuhf there is no method that creates the text or image encoder by themselves, but it's easy enough to encode just text (or images), or extract either tower, to extract text tower you want to set the custom text flag so all of the text bits are pushed into their own sub-module

model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k', force_custom_text=True)
tokenizer = open_clip.get_tokenizer('ViT-H-14')
text_encoder = model.text
del model
x = text(input)
x = F.normalize(x, dim=-1)  # if normalized output desired
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-H-14')
model.encode_text(text_inputs)

Sign up or log in to comment