Model Card: clip-imageclef

Model Details

OpenAI CLIP model fine-tuned using image-caption pairs from the Caption Prediction dataset provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.

Model Date

September 6, 2021

Model Type

The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

Fine-tuning

The fine-tuning can be reproduced using code from the Github repository elsevierlabs-os/clip-image-search.

Usage

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images, 
                   return_tensors="pt", padding=True)
output = model(**inputs)

Performance

Model-name k=1 k=3 k=5 k=10 k=20
zero-shot CLIP (baseline) 0.426 0.534 0.558 0.573 0.578
clip-imageclef (this model) 0.802 0.872 0.877 0.879 0.880
Downloads last month
51
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.