Model Card: clip-imageclef
OpenAI CLIP model fine-tuned using image-caption pairs from the Caption Prediction dataset provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.
September 6, 2021
The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
The fine-tuning can be reproduced using code from the Github repository elsevierlabs-os/clip-image-search.
from transformers import CLIPModel, CLIPProcessor model = CLIPModel.from_pretrained("sujitpal/clip-imageclef") processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") inputs = processor(text=captions, images=images, return_tensors="pt", padding=True) output = model(**inputs)
|zero-shot CLIP (baseline)||0.426||0.534||0.558||0.573||0.578|
|clip-imageclef (this model)||0.802||0.872||0.877||0.879||0.880|
- Downloads last month
This model can be loaded on the Inference API on-demand.