sujitpal
/

clip-imageclef

Zero-Shot Image Classification Transformers PyTorch English clip multimodal language vision image-search Inference Endpoints

Model card Files Files and versions Community

clip-imageclef / README.md

sujitpal's picture

Update README.md

cb96bbd 6 months ago

|

raw history blame contribute delete

No virus

1.66 kB

	---
	language:
	- en
	tags:
	- multimodal
	- language
	- vision
	- image-search
	- pytorch
	license:
	- mit
	metrics:
	- MRR
	---

	### Model Card: clip-imageclef

	### Model Details

	[OpenAI CLIP model](https://openai.com/blog/clip/) fine-tuned using image-caption pairs from the [Caption Prediction dataset](https://www.imageclef.org/2017/caption) provided for the ImageCLEF 2017 competition. The model was evaluated using before and after fine-tuning, MRR@10 were 0.57 and 0.88 respectively.

	### Model Date

	September 6, 2021

	### Model Type

	The base model is the OpenAI CLIP model. It uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

	### Fine-tuning

	The fine-tuning can be reproduced using code from the Github repository [elsevierlabs-os/clip-image-search](https://github.com/elsevierlabs-os/clip-image-search#fine-tuning).

	### Usage

	```python
	from transformers import CLIPModel, CLIPProcessor

	model = CLIPModel.from_pretrained("sujitpal/clip-imageclef")
	processor = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
	inputs = processor(text=captions, images=images,
	return_tensors="pt", padding=True)
	output = model(**inputs)
	```

	### Performance

	\| Model-name \| k=1 \| k=3 \| k=5 \| k=10 \| k=20 \|
	\| -------------------------------- \| ----- \| ----- \| ----- \| ----- \| ----- \|
	\| zero-shot CLIP (baseline) \| 0.426 \| 0.534 \| 0.558 \| 0.573 \| 0.578 \|
	\| clip-imageclef (this model) \| 0.802 \| 0.872 \| 0.877 \| 0.879 \| 0.880 \|