File size: 1,659 Bytes
33f33f4
 
 
 
 
 
 
 
b01520a
33f33f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb96bbd
33f33f4
 
 
ce398a0
33f33f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
language:
- en
tags:
- multimodal
- language
- vision
- image-search
- pytorch
license:
- mit
metrics:
- MRR
---

### Model Card: clip-imageclef

### Model Details

[OpenAI CLIP model](https://openai.com/blog/clip/) fine-tuned using image-caption pairs from the [Caption Prediction dataset](https://www.imageclef.org/2017/caption) provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.

### Model Date

September 6, 2021

### Model Type

The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

### Fine-tuning

The fine-tuning can be reproduced using code from the Github repository [elsevierlabs-os/clip-image-search](https://github.com/elsevierlabs-os/clip-image-search#fine-tuning).

### Usage

```python
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
inputs = processor(text=captions, images=images, 
                   return_tensors="pt", padding=True)
output = model(**inputs)
```

### Performance

| Model-name                       | k=1   | k=3   | k=5   | k=10  | k=20  |
| -------------------------------- | ----- | ----- | ----- | ----- | ----- |
| zero-shot CLIP (baseline)        | 0.426 | 0.534 | 0.558 | 0.573 | 0.578 |
| clip-imageclef (this model)      | 0.802 | 0.872 | 0.877 | 0.879 | 0.880 |