Edit model card

Model Card: clip-imageclef

Model Details

OpenAI CLIP model fine-tuned using image-caption pairs from the Caption Prediction dataset provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.

Model Date

September 6, 2021

Model Type

The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.


The fine-tuning can be reproduced using code from the Github repository elsevierlabs-os/clip-image-search.


from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images, 
                   return_tensors="pt", padding=True)
output = model(**inputs)


Model-name k=1 k=3 k=5 k=10 k=20
zero-shot CLIP (baseline) 0.426 0.534 0.558 0.573 0.578
clip-imageclef (this model) 0.802 0.872 0.877 0.879 0.880
Downloads last month
Hosted inference API
Drag image file here or click to browse from your device
This model can be loaded on the Inference API on-demand.