rabiulawal commited on
Commit
e7c9abf
·
verified ·
1 Parent(s): be0b9d7

Added model usage code snippet

Browse files
Files changed (1) hide show
  1. README.md +69 -3
README.md CHANGED
@@ -1,3 +1,69 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ ---
4
+
5
+
6
+ **Model Details**
7
+
8
+ VisMin-CLIP is a fine-tuned version of the pretrained CLIP model, designed to enhance fine-grained and compositional abilities beyond the base model. Fine-tuning was conducted using the [OpenCLIP](https://github.com/mlfoundations/open_clip) library, an open-source implementation of OpenAI’s CLIP.
9
+
10
+
11
+ **Model Summary**
12
+
13
+ - Model Date: July 2024
14
+ - Model type: Vision-language Foundation Model (image+text)
15
+ - Parent Model: [openai/clip-vit-large-patch14](openai/clip-vit-large-patch14)
16
+
17
+ **Usage**
18
+
19
+ Similar to any OpenCLIP model can easily be loaded from the checkpoint:
20
+
21
+
22
+ ```python
23
+ import open_clip
24
+
25
+ model_cls_name = "ViT-L-14"
26
+ checkpoint_path = "path/to/checkpoint"
27
+ model, _, preprocess = open_clip.create_model_and_transforms(
28
+ model_name=model_cls_name, pretrained=checkpoint_path, device=device
29
+ )
30
+ tokenizer = open_clip.get_tokenizer(model_cls_name)
31
+
32
+ model = model.to(device).eval()
33
+ ```
34
+
35
+ Once loaded, you can encode the image and text to do zero-shot image classification:
36
+
37
+ ```python
38
+ import torch
39
+ from PIL import Image
40
+
41
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
42
+ image = Image.open(requests.get(url, stream=True).raw)
43
+ image = preprocess(image).unsqueeze(0)
44
+ text = tokenizer(["a diagram", "a dog", "a cat"])
45
+
46
+ with torch.no_grad(), torch.cuda.amp.autocast():
47
+ image_features = model.encode_image(image)
48
+ text_features = model.encode_text(text)
49
+ image_features /= image_features.norm(dim=-1, keepdim=True)
50
+ text_features /= text_features.norm(dim=-1, keepdim=True)
51
+
52
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
53
+
54
+ print("Label probs:", text_probs)
55
+
56
+ ```
57
+
58
+
59
+ **Bibtex**
60
+
61
+ If you use VisMin-CLIP in your work, please cite it as follows:
62
+
63
+ ```
64
+ @article{vismin2024,
65
+ title={VisMin: Visual Minimal-Change Understanding},
66
+ author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
67
+ year={2024}
68
+ }
69
+ ```