clip-rsicd-v2 / README.md

fixed typos and added info pointed out in evaluation

d0f3b44 almost 3 years ago

No virus

4.73 kB

	---
	tags:
	- vision
	---

	# Model Card: clip-rsicd

	## Model Details

	This model is a fine-tuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with an aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specifically on remote sensing images.

	### Model Date

	July 2021

	### Model Type

	The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

	### Model Version

	We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd#evaluation-results) for performance metrics on zero-shot classification for each of those.

	### Training

	To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
	The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peak learning rate 1e-4 on 1 TPU-v3-8.
	Full log of the training run can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/2dj1exsw).

	### Demo

	Check out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).


	### Documents

	- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/Fine_tuning_CLIP_with_HF_on_TPU.ipynb)


	### Use with Transformers

	```python3
	from PIL import Image
	import requests

	from transformers import CLIPProcessor, CLIPModel

	model = CLIPModel.from_pretrained("flax-community/clip-rsicd-v2")
	processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd-v2")

	url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	labels = ["residential area", "playground", "stadium", "forest", "airport"]
	inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)

	outputs = model(**inputs)
	logits_per_image = outputs.logits_per_image # this is the image-text similarity score
	probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
	for l, p in zip(labels, probs[0]):
	print(f"{l:<16} {p:.4f}")
	```
	[Try it on colab](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/clip_rsicd_zero_shot.ipynb)


	## Model Use

	### Intended Use

	The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification.

	In addition, we can imagine applications in defense and law enforcement, climate change and global warming, and even some consumer applications. A partial list of applications can be found [here](https://github.com/arampacha/CLIP-rsicd#applications). In general we think such models can be useful as digital assistants for humans engaged in searching through large collections of images.

	We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.


	#### Primary intended uses

	The primary intended users of these models are AI researchers.

	We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.



	## Data

	The model was trained on publicly available remote sensing image captions datasets. Namely [RSICD](https://github.com/201528014227051/RSICD_optimal), [UCM](https://mega.nz/folder/wCpSzSoS#RXzIlrv--TDt3ENZdKN8JA) and [Sydney](https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ). More information on the datasets used can be found on [our project page](https://github.com/arampacha/CLIP-rsicd#dataset).



	## Performance and Limitations

	### Performance

	\| Model-name \| k=1 \| k=3 \| k=5 \| k=10 \|
	\| -------------------------------- \| ----- \| ----- \| ----- \| ----- \|
	\| original CLIP \| 0.572 \| 0.745 \| 0.837 \| 0.939 \|
	\| clip-rsicd-v2 (this model) \| 0.883 \| 0.968 \| 0.982 \| 0.998 \|

	## Limitations

	The model is fine-tuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to [CLIP model card](https://huggingface.co/openai/clip-vit-base-patch32#limitations) for details on those.