Image embeddings are different from the official OpenAI clip model
The normalized image embeddings generated by this huggingface version of the CLIP model and the official openai implementation produce different embeddings.
I downloaded the following image: https://thumbs.dreamstime.com/b/lovely-cat-as-domestic-animal-view-pictures-182393057.jpg
I generated image embeddings using this model with the following code:
from transformers import CLIPModel, CLIPProcessor
_model = CLIPModel.from_pretrained('openai/clip-vit-large-patch14')
_processor = CLIPProcessor.from_pretrained('openai/clip-vit-large-patch14')
img = Image.open('lovely-cat-as-domestic-animal-view-pictures-182393057.jpg').convert('RGB')
inputs = processor(images=img, return_tensors='pt', padding=True)
with torch.no_grad():
vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs[1]
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
I get:
tensor([-0.0262, 0.0541, 0.0122, 0.0053, 0.0453, 0.0138, 0.0141, 0.0035,
0.0202, -0.0173])
When I use the official implementation with this code:
import clip
__model, __preprocess = clip.load("ViT-L/14", device='cpu')
with torch.no_grad():
__image_features = __model.encode_image(__image)
__image_features /= __image_features.norm(dim=-1, keepdim=True)
print(__image_features[0, :10])
I get:
tensor([-0.0192, 0.0559, 0.0147, 0.0041, 0.0461, 0.0098, 0.0115, 0.0014,
0.0174, -0.0151])
You can see the that values are similar, but are out by a bit.
If I calculate the cosine similarity / dot product I get:
image_embeds @ image_features.t()
# tensor([[0.9971]])
I get the same result when I load up the official openai weights with the open_clip implementation also.
So, there's some subtle difference here.
I'm running transformers 4.20.0
Actually, I worked it out. The preprocessing is different from the huggingface CLIPProcessor, and the default clip implementations. So the model was getting a slightly different version of the image.
From what I can tell so far, due to different implementations for the center cropping, it's changing pixels.
TL;DR if you need exactly the same input for a given image, then use the openai input processing pipeline like this:
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
image_processor = Compose([
Resize(size=224, interpolation=Image.BICUBIC),
CenterCrop(size=(224, 224)),
lambda img: img.convert('RGB'),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
])
inputs=dict(pixel_values=image_processor(img).unsqueeze(0))
with torch.no_grad():
vision_outputs = _model.vision_model(**inputs)
image_embeds = vision_outputs[1]
image_embeds = _model.visual_projection(image_embeds)
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
print(image_embeds[0, :10])
tensor([-0.0192, 0.0559, 0.0147, 0.0041, 0.0461, 0.0098, 0.0115, 0.0014,
0.0174, -0.0151])
I found the text embedding differs quite a lot. Does this make sense?