File size: 1,659 Bytes
33f33f4 b01520a 33f33f4 cb96bbd 33f33f4 ce398a0 33f33f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- en
tags:
- multimodal
- language
- vision
- image-search
- pytorch
license:
- mit
metrics:
- MRR
---
### Model Card: clip-imageclef
### Model Details
[OpenAI CLIP model](https://openai.com/blog/clip/) fine-tuned using image-caption pairs from the [Caption Prediction dataset](https://www.imageclef.org/2017/caption) provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.
### Model Date
September 6, 2021
### Model Type
The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
### Fine-tuning
The fine-tuning can be reproduced using code from the Github repository [elsevierlabs-os/clip-image-search](https://github.com/elsevierlabs-os/clip-image-search#fine-tuning).
### Usage
```python
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images,
return_tensors="pt", padding=True)
output = model(**inputs)
```
### Performance
| Model-name | k=1 | k=3 | k=5 | k=10 | k=20 |
| -------------------------------- | ----- | ----- | ----- | ----- | ----- |
| zero-shot CLIP (baseline) | 0.426 | 0.534 | 0.558 | 0.573 | 0.578 |
| clip-imageclef (this model) | 0.802 | 0.872 | 0.877 | 0.879 | 0.880 |
|