arampacha commited on
Commit
b60ae46
1 Parent(s): 0cd9d92

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vision
4
+ ---
5
+
6
+ # Model Card: clip-rsicd
7
+
8
+ ## Model Details
9
+
10
+ This model is a fine-tuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specifically on remote sensing images.
11
+
12
+ ### Model Date
13
+
14
+ July 2021
15
+
16
+ ### Model Type
17
+
18
+ The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
19
+
20
+ ### Model Version
21
+
22
+ We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd) for zero-shot classification for each of those.
23
+
24
+ ### Training
25
+
26
+ To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
27
+ The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peak learning rate 1e-4 on 1 TPU-v3-8.
28
+ Full log of the training run done to produce can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/2dj1exsw).
29
+
30
+ ### Demo
31
+
32
+ Checko out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).
33
+
34
+
35
+ ### Documents
36
+
37
+ - [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/Fine_tuning_CLIP_with_HF_on_TPU.ipynb)
38
+
39
+
40
+ ### Use with Transformers
41
+
42
+ ```python3
43
+ from PIL import Image
44
+ import requests
45
+
46
+ from transformers import CLIPProcessor, CLIPModel
47
+
48
+ model = CLIPModel.from_pretrained("flax-community/clip-rsicd-v2")
49
+ processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd-v2")
50
+
51
+ url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
52
+ image = Image.open(requests.get(url, stream=True).raw)
53
+
54
+ labels = ["residential area", "playground", "stadium", "forest", "airport"]
55
+ inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
56
+
57
+ outputs = model(**inputs)
58
+ logits_per_image = outputs.logits_per_image # this is the image-text similarity score
59
+ probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
60
+ for l, p in zip(labels, probs[0]):
61
+ print(f"{l:<16} {p:.4f}")
62
+ ```
63
+ [Try it on colab](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/clip_rsicd_zero_shot.ipynb)
64
+
65
+
66
+ ## Model Use
67
+
68
+ ### Intended Use
69
+
70
+ The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
71
+
72
+ #### Primary intended uses
73
+
74
+ The primary intended users of these models are AI researchers.
75
+
76
+ We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
77
+
78
+
79
+
80
+ ## Data
81
+
82
+ The model was trained on publicly available remote sensing image cations datasets. Namely [RSICD](https://github.com/201528014227051/RSICD_optimal), [UCM](https://mega.nz/folder/wCpSzSoS#RXzIlrv--TDt3ENZdKN8JA) and [Sydney](https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ).
83
+
84
+
85
+ ## Performance and Limitations
86
+
87
+ ### Performance
88
+
89
+ | Model-name | k=1 | k=3 | k=5 | k=10 |
90
+ | -------------------------------- | ----- | ----- | ----- | ----- |
91
+ | original CLIP | 0.572 | 0.745 | 0.837 | 0.939 |
92
+ | clip-rsicd-v2 (this model) | **0.883** | **0.968** | **0.982** | **0.998** |
93
+
94
+ ## Limitations
95
+
96
+ The model is fine-tuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to [CLIP model card](https://huggingface.co/openai/clip-vit-base-patch32#limitations) for details on those.