Image embeddings are different from the official OpenAI clip model

#1
by eugeneware - opened

The normalized image embeddings generated by this huggingface version of the CLIP model and the official openai implementation produce different embeddings.

I downloaded the following image: https://thumbs.dreamstime.com/b/lovely-cat-as-domestic-animal-view-pictures-182393057.jpg

I generated image embeddings using this model with the following code:

from transformers import CLIPModel, CLIPProcessor
_model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
_processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14')
img = Image.open('lovely-cat-as-domestic-animal-view-pictures-182393057.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt', padding=True)
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True) 
print(image_embeds[0, :10])

I get:

tensor([-0.0262,  0.0541,  0.0122,  0.0053,  0.0453,  0.0138,  0.0141,  0.0035,
         0.0202, -0.0173])

When I use the official implementation with this code:

import clip
__model, __preprocess = clip.load("ViT-L/14", device='cpu')
with torch.no_grad():
    __image_features = __model.encode_image(__image)
    __image_features /= __image_features.norm(dim=-1, keepdim=True)
print(__image_features[0, :10])

I get:

tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

You can see the that values are similar, but are out by a bit.

If I calculate the cosine similarity / dot product I get:

image_embeds @ image_features.t()
# tensor([[0.9971]])

I get the same result when I load up the official openai weights with the open_clip implementation also.

So, there's some subtle difference here.

I'm running transformers 4.20.0

Actually, I worked it out. The preprocessing is different from the huggingface CLIPProcessor, and the default clip implementations. So the model was getting a slightly different version of the image.

From what I can tell so far, due to different implementations for the center cropping, it's changing pixels.

TL;DR if you need exactly the same input for a given image, then use the openai input processing pipeline like this:

from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
image_processor = Compose([
    Resize(size=224, interpolation=Image.BICUBIC),
    CenterCrop(size=(224, 224)),
    lambda img: img.convert('RGB'),
    ToTensor(),
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
inputs=dict(pixel_values=image_processor(img).unsqueeze(0))
with torch.no_grad():
    vision_outputs = _model.vision_model(**inputs)
    image_embeds = vision_outputs[1]
    image_embeds = _model.visual_projection(image_embeds)
    image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
tensor([-0.0192,  0.0559,  0.0147,  0.0041,  0.0461,  0.0098,  0.0115,  0.0014,
         0.0174, -0.0151])

cc @valhalla in case you hadn't seen this!

I found the text embedding differs quite a lot. Does this make sense?

Sign up or log in to comment