Files changed (1) hide show
  1. README.md +96 -20
README.md CHANGED
@@ -1,6 +1,11 @@
1
  ---
 
2
  tags:
 
3
  - vision
 
 
 
4
  widget:
5
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
6
  candidate_labels: playing music, playing sports
@@ -11,20 +16,65 @@ widget:
11
 
12
  The [ALIGN](https://arxiv.org/abs/2102.05918) model was proposed in "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.
13
  ALIGN features a dual-encoder architecture with [EfficientNet](https://huggingface.co/docs/transformers/main/en/model_doc/efficientnet#efficientnet) as its vision encoder and [BERT](https://huggingface.co/docs/transformers/main/en/model_doc/bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
14
- The code for ALIGN was not publicly released, the base model is converted from the original implementation of the Kakao Brain team.
15
-
16
- ## Model Description
17
- The abstract from the paper is the following:
18
-
19
- Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or ALIGN all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Use with Transformers
23
 
 
 
24
  ```python3
25
- from PIL import Image
26
  import requests
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
 
 
 
 
 
28
  from transformers import AlignProcessor, AlignModel
29
 
30
  processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
@@ -32,14 +82,49 @@ model = AlignModel.from_pretrained("kakaobrain/align-base")
32
 
33
  url = "http://images.cocodataset.org/val2017/000000039769.jpg"
34
  image = Image.open(requests.get(url, stream=True).raw)
 
 
 
35
 
36
- inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt")
 
37
 
38
- outputs = model(**inputs)
39
- logits_per_image = outputs.logits_per_image # this is the image-text similarity score
40
- probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
 
 
41
  ```
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Model Use
45
 
@@ -53,12 +138,3 @@ The primary intended users of these models are AI researchers.
53
 
54
  We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
55
 
56
- ### Out-of-Scope Use Cases
57
-
58
- **Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of ALIGN’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful.
59
-
60
- Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
61
-
62
- Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
63
-
64
-
 
1
  ---
2
+ language: en
3
  tags:
4
+ - align
5
  - vision
6
+ - multi-modal
7
+ datasets:
8
+ - coyo-700m
9
  widget:
10
  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
11
  candidate_labels: playing music, playing sports
 
16
 
17
  The [ALIGN](https://arxiv.org/abs/2102.05918) model was proposed in "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.
18
  ALIGN features a dual-encoder architecture with [EfficientNet](https://huggingface.co/docs/transformers/main/en/model_doc/efficientnet#efficientnet) as its vision encoder and [BERT](https://huggingface.co/docs/transformers/main/en/model_doc/bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.
 
 
 
 
 
 
19
 
20
+ The code for ALIGN was not publicly released, the base model is converted from the original implementation of the Kakao Brain team. This implementation follows the same architecture and hyperparameters as provided in the original Google model but is trained on the open source [COYO](https://github.com/kakaobrain/coyo-dataset) dataset. Google’s [ALIGN](https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html) model, while trained on a huge dataset of 1.8 billion image-text pairs, cannot be replicated as the datasets is not public. Kakao Brain's ALIGN is on-par or outperforms Google ALIGN's reported metrics despite being trained on the much smaller, albeit carefully curated COYO-700M dataset.
21
+ <p>
22
+ <center>
23
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/132_vit_align/align-performance.png" alt="ALIGN performance"/>
24
+ </center>
25
+ </p>
26
+
27
+ ## COYO-700M Dataset
28
+ [COYO](https://github.com/kakaobrain/coyo-dataset#dataset-preview) is an image-text dataset of 700 million pairs similar to Google's `ALIGN 1.8B` image-text dataset which is a collection of "noisy" alt-text and image pairs from webpages, but open-source. `COYO-700M` and `ALIGN 1.8B` are "noisy" because minimal filtering was applied. `COYO` is similar to the other open-source image-text dataset, `LAION` but with the following differences. While `LAION` 2B is a much larger dataset of 2 billion English pairs, compared to `COYO`’s 700 million pairs, `COYO` pairs come with more metadata that give users more flexibility and finer-grained control over usage. The following table shows the differences: `COYO` comes equipped with aesthetic scores for all pairs, more robust watermark scores, and face count data.
29
+
30
+ | COYO | LAION 2B| ALIGN 1.8B |
31
+ | :----: | :----: | :----: |
32
+ | Image-text similarity score calculated with CLIP ViT-B/32 and ViT-L/14 models, they are provided as metadata but nothing is filtered out so as to avoid possible elimination bias | Image-text similarity score provided with CLIP (ViT-B/32) - only examples above threshold 0.28 | Minimal, Frequency based filtering |
33
+ | NSFW filtering on images and text | NSFW filtering on images | [Google Cloud API](https://cloud.google.com/vision) |
34
+ | Face recognition (face count) data provided as meta-data | No face recognition data | NA |
35
+ | 700 million pairs all English | 2 billion English| 1.8 billion |
36
+ | From CC 2020 Oct - 2021 Aug| From CC 2014-2020| NA |
37
+ |Aesthetic Score | Aesthetic Score Partial | NA|
38
+ |More robust Watermark score | Watermark Score | NA|
39
+ |Hugging Face Hub | Hugging Face Hub | Not made public |
40
+ | English | English | English? |
41
+
42
+ COYO is available on the hub as a [dataset](https://huggingface.co/datasets/kakaobrain/coyo-700m).
43
 
44
  ## Use with Transformers
45
 
46
+ ### Zero-Shot Image Classification
47
+
48
  ```python3
 
49
  import requests
50
+ import torch
51
+ from PIL import Image
52
+ from transformers import AlignProcessor, AlignModel
53
+
54
+ processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
55
+ model = AlignModel.from_pretrained("kakaobrain/align-base")
56
+
57
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
58
+ image = Image.open(requests.get(url, stream=True).raw)
59
+ candidate_labels = ["an image of a cat", "an image of a dog"]
60
+
61
+ inputs = processor(text=candidate_labels, images=image, return_tensors="pt")
62
+
63
+ with torch.no_grad():
64
+ outputs = model(**inputs)
65
+
66
+ # this is the image-text similarity score
67
+ logits_per_image = outputs.logits_per_image
68
+ # we can take the softmax to get the label probabilities
69
+ probs = logits_per_image.softmax(dim=1)
70
+ print(probs)
71
+ ```
72
 
73
+ ### Multi-Modal Embedding Retrieval
74
+ ```python3
75
+ import requests
76
+ import torch
77
+ from PIL import Image
78
  from transformers import AlignProcessor, AlignModel
79
 
80
  processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
 
82
 
83
  url = "http://images.cocodataset.org/val2017/000000039769.jpg"
84
  image = Image.open(requests.get(url, stream=True).raw)
85
+ text = "an image of a cat"
86
+
87
+ inputs = processor(text=text, images=image, return_tensors="pt")
88
 
89
+ with torch.no_grad():
90
+ outputs = model(**inputs)
91
 
92
+ # multi-modal text embedding
93
+ text_embeds = outputs.text_embeds
94
+
95
+ # multi-modal image embedding
96
+ image_embeds = outputs.image_embeds
97
  ```
98
 
99
+ Alternatively, retrieve image or text embeddings separately.
100
+ ```python3
101
+ import requests
102
+ import torch
103
+ from PIL import Image
104
+ from transformers import AlignProcessor, AlignModel
105
+
106
+ processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
107
+ model = AlignModel.from_pretrained("kakaobrain/align-base")
108
+
109
+ # image embeddings
110
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
111
+ image = Image.open(requests.get(url, stream=True).raw)
112
+ inputs = processor(images=image, return_tensors="pt")
113
+
114
+ image_embeds = model.get_image_features(
115
+ pixel_values=inputs['pixel_values'],
116
+ )
117
+
118
+ # text embeddings
119
+ text = "an image of a cat"
120
+ inputs = processor(text=text, return_tensors="pt")
121
+
122
+ text_embeds = model.get_text_features(
123
+ input_ids=inputs['input_ids'],
124
+ attention_mask=inputs['attention_mask'],
125
+ token_type_ids=inputs['token_type_ids'],
126
+ )
127
+ ```
128
 
129
  ## Model Use
130
 
 
138
 
139
  We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
140