OpenCLIP
PyTorch
clip
vaishaal commited on
Commit
4a52964
1 Parent(s): 25a93bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -3,3 +3,105 @@ license: other
3
  license_name: apple-sample-code-license
4
  license_link: LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license_name: apple-sample-code-license
4
  license_link: LICENSE
5
  ---
6
+ A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-5B.
7
+ Data Filtering Networks (DFNs) are small used to automatically filter large pools of uncurated data.
8
+ This model was trained on 5B images that were filtered from a pool of 43B uncurated image-text pairs
9
+ (12.8B image-text pairs from CommonPool-12.8B + 30B additional public image-text pairs).
10
+
11
+ This model has been converted to PyTorch from the original JAX checkpoints from Axlearn (https://github.com/apple/axlearn).
12
+ These weights are directly usable in OpenCLIP (image + text).
13
+
14
+
15
+ ## Model Details
16
+
17
+ - **Model Type:** Contrastive Image-Text, Zero-Shot Image Classification.
18
+ - **Dataset:** DFN-5b
19
+ - **Papers:**
20
+ - Data Filtering Networks: https://arxiv.org/abs/2309.17425
21
+ - **Samples Seen:** 39B (224 x 224) + 5B (384 x 384)
22
+ ## Model Metrics
23
+
24
+ | Eval Dataset | Metric |
25
+ |:-----------------------|---------:|
26
+ | ImageNet 1k | 0.84206 |
27
+ | Caltech-101 | 0.951389 |
28
+ | CIFAR-10 | 0.9879 |
29
+ | CIFAR-100 | 0.9041 |
30
+ | CLEVR Counts | 0.3424 |
31
+ | CLEVR Distance | 0.214933 |
32
+ | Country211 | 0.3591 |
33
+ | Describable Textures | 0.707979 |
34
+ | EuroSAT | 0.608333 |
35
+ | FGVC Aircraft | 0.657362 |
36
+ | Food-101 | 0.962099 |
37
+ | GTSRB | 0.681077 |
38
+ | ImageNet Sketch | 0.72214 |
39
+ | ImageNet v2 | 0.7787 |
40
+ | ImageNet-A | 0.802 |
41
+ | ImageNet-O | 0.3945 |
42
+ | ImageNet-R | 0.929 |
43
+ | KITTI Vehicle Distance | 0.40225 |
44
+ | MNIST | 0.8372 |
45
+ | ObjectNet | 0.796867 |
46
+ | Oxford Flowers-102 | 0.896257 |
47
+ | Oxford-IIIT Pet | 0.968432 |
48
+ | Pascal VOC 2007 | 0.7914 |
49
+ | PatchCamelyon | 0.695953 |
50
+ | Rendered SST2 | 0.566722 |
51
+ | RESISC45 | 0.755079 |
52
+ | Stanford Cars | 0.95809 |
53
+ | STL-10 | 0.991125 |
54
+ | SUN397 | 0.768257 |
55
+ | SVHN | 0.671251 |
56
+ | Flickr | 0.8663 |
57
+ | MSCOCO | 0.636489 |
58
+ | WinoGAViL | 0.570759 |
59
+ | iWildCam | 0.215716 |
60
+ | Camelyon17 | 0.711536 |
61
+ | FMoW | 0.209024 |
62
+ | Dollar Street | 0.711449 |
63
+ | GeoDE | 0.921503 |
64
+ | **Average** | **0.704914** |
65
+
66
+ ## Model Usage
67
+ ### With OpenCLIP
68
+ ```
69
+ import torch
70
+ import torch.nn.functional as F
71
+ from urllib.request import urlopen
72
+ from PIL import Image
73
+ from open_clip import create_model_from_pretrained, get_tokenizer
74
+
75
+ model, preprocess = create_model_from_pretrained('hf-hub:apple/DFN5B-CLIP-ViT-H-14-384')
76
+ tokenizer = get_tokenizer('ViT-H-14')
77
+
78
+ image = Image.open(urlopen(
79
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
80
+ ))
81
+ image = preprocess(image).unsqueeze(0)
82
+
83
+ labels_list = ["a dog", "a cat", "a donut", "a beignet"]
84
+ text = tokenizer(labels_list, context_length=model.context_length)
85
+
86
+ with torch.no_grad(), torch.cuda.amp.autocast():
87
+ image_features = model.encode_image(image)
88
+ text_features = model.encode_text(text)
89
+ image_features = F.normalize(image_features, dim=-1)
90
+ text_features = F.normalize(text_features, dim=-1)
91
+
92
+ text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)
93
+
94
+ zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
95
+ print("Label probabilities: ", zipped_list)
96
+ ```
97
+
98
+ ## Citation
99
+ ```bibtex
100
+ @article{fang2023data,
101
+ title={Data Filtering Networks},
102
+ author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
103
+ journal={arXiv preprint arXiv:2309.17425},
104
+ year={2023}
105
+ }
106
+
107
+ ```