SanghyukChun commited on
Commit
7188231
1 Parent(s): 4f28e67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -3
README.md CHANGED
@@ -2,8 +2,70 @@
2
  tags:
3
  - pytorch_model_hub_mixin
4
  - model_hub_mixin
 
 
 
5
  ---
6
 
7
- This model has been pushed to the Hub using ****:
8
- - Repo: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - pytorch_model_hub_mixin
4
  - model_hub_mixin
5
+ license: mit
6
+ datasets:
7
+ - mlfoundations/datacomp_1b
8
  ---
9
 
10
+ ## Official implementation of pre-trained ViT-B/16 ProLIP on DataComp 1B
11
+
12
+ - This weight is a pre-trained ViT-B/16
13
+ - Pre-training dataset
14
+ - DataComp 1B / Seen samples 12.8B
15
+
16
+ ### Overview
17
+ - Paper: https://arxiv.org/abs/2410.18857
18
+ - GitHub: https://github.com/naver-ai/prolip
19
+ - More models are available at https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291
20
+
21
+ ### Performance overview
22
+ - Zero-shot ImageNet-1k top-1 accuracy: 74.6%
23
+ - Zero-shot ImageNet distribution shifts: 63.0%
24
+ - Zero-shot VTAB performance: 63.7%
25
+ - Zero-shot retrieval performance: 59.6%
26
+ - Average zero-shot performance on 38 tasks: 63.3%
27
+
28
+ ```python
29
+ import requests
30
+ from PIL import Image
31
+
32
+ import torch
33
+ from prolip.model import ProLIPHF
34
+ from transformers import CLIPProcessor
35
+ from prolip.tokenizer import HFTokenizer
36
+
37
+ import warnings
38
+ warnings.simplefilter(action='ignore', category=FutureWarning)
39
+
40
+ processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
41
+ model = ProLIPHF.from_pretrained("SanghyukChun/ProLIP-ViT-B-16-DC-1B-12_8M")
42
+ tokenizer = HFTokenizer("timm/ViT-B-16-SigLIP", context_length=64, clean="canonicalize")
43
+
44
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
45
+ image = Image.open(requests.get(url, stream=True).raw)
46
+ inputs = processor(images=image, return_tensors="pt", padding=True)
47
+ texts = ["A couple of cats laying on top of a pink blanket.", "A man walks through a flooded road during a rainstorm", "photo"]
48
+ texts = tokenizer(texts)
49
+
50
+ outputs = model(image=inputs["pixel_values"], text=texts)
51
+
52
+ l2_logit = outputs["image_features"]["mean"] @ outputs["text_features"]["mean"].T
53
+ i_unc = torch.exp(outputs["image_features"]["std"]).sum(dim=-1)
54
+ t_unc = torch.exp(outputs["text_features"]["std"]).sum(dim=-1)
55
+ csd_logit = l2_logit - 0.5 * t_unc
56
+ csd_logit2 = l2_logit.T - 0.5 * i_unc
57
+ print("Mean-only image-to-text logits (by L2 distance):", l2_logit)
58
+ print("Uncertainty-aware image-to-text logits (by CSD):", csd_logit)
59
+ print("Uncertainty-aware text-to-image logits (by CSD):", csd_logit2.T)
60
+ print("Image uncertainty: ", i_unc)
61
+ print("Text uncertainty: ", t_unc)
62
+ ```
63
+
64
+ ```
65
+ @article{chun2024prolip,
66
+ title={Probabilistic Language-Image Pre-Training},
67
+ author={Chun, Sanghyuk and Kim, Wonjae and Park, Song and Yun, Sangdoo},
68
+ journal={arXiv preprint arXiv:2410.18857},
69
+ year={2024}
70
+ }
71
+ ```