rabiulawal
commited on
Added model usage code snippet
Browse files
README.md
CHANGED
@@ -1,3 +1,69 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
**Model Details**
|
7 |
+
|
8 |
+
VisMin-CLIP is a fine-tuned version of the pretrained CLIP model, designed to enhance fine-grained and compositional abilities beyond the base model. Fine-tuning was conducted using the [OpenCLIP](https://github.com/mlfoundations/open_clip) library, an open-source implementation of OpenAI’s CLIP.
|
9 |
+
|
10 |
+
|
11 |
+
**Model Summary**
|
12 |
+
|
13 |
+
- Model Date: July 2024
|
14 |
+
- Model type: Vision-language Foundation Model (image+text)
|
15 |
+
- Parent Model: [openai/clip-vit-large-patch14](openai/clip-vit-large-patch14)
|
16 |
+
|
17 |
+
**Usage**
|
18 |
+
|
19 |
+
Similar to any OpenCLIP model can easily be loaded from the checkpoint:
|
20 |
+
|
21 |
+
|
22 |
+
```python
|
23 |
+
import open_clip
|
24 |
+
|
25 |
+
model_cls_name = "ViT-L-14"
|
26 |
+
checkpoint_path = "path/to/checkpoint"
|
27 |
+
model, _, preprocess = open_clip.create_model_and_transforms(
|
28 |
+
model_name=model_cls_name, pretrained=checkpoint_path, device=device
|
29 |
+
)
|
30 |
+
tokenizer = open_clip.get_tokenizer(model_cls_name)
|
31 |
+
|
32 |
+
model = model.to(device).eval()
|
33 |
+
```
|
34 |
+
|
35 |
+
Once loaded, you can encode the image and text to do zero-shot image classification:
|
36 |
+
|
37 |
+
```python
|
38 |
+
import torch
|
39 |
+
from PIL import Image
|
40 |
+
|
41 |
+
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
|
42 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
43 |
+
image = preprocess(image).unsqueeze(0)
|
44 |
+
text = tokenizer(["a diagram", "a dog", "a cat"])
|
45 |
+
|
46 |
+
with torch.no_grad(), torch.cuda.amp.autocast():
|
47 |
+
image_features = model.encode_image(image)
|
48 |
+
text_features = model.encode_text(text)
|
49 |
+
image_features /= image_features.norm(dim=-1, keepdim=True)
|
50 |
+
text_features /= text_features.norm(dim=-1, keepdim=True)
|
51 |
+
|
52 |
+
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
53 |
+
|
54 |
+
print("Label probs:", text_probs)
|
55 |
+
|
56 |
+
```
|
57 |
+
|
58 |
+
|
59 |
+
**Bibtex**
|
60 |
+
|
61 |
+
If you use VisMin-CLIP in your work, please cite it as follows:
|
62 |
+
|
63 |
+
```
|
64 |
+
@article{vismin2024,
|
65 |
+
title={VisMin: Visual Minimal-Change Understanding},
|
66 |
+
author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
|
67 |
+
year={2024}
|
68 |
+
}
|
69 |
+
```
|