Transformers
PyTorch
flava
pretraining
Inference Endpoints
aps commited on
Commit
9b2aad4
1 Parent(s): 06305f7

Add model card

Browse files
Files changed (1) hide show
  1. README.md +257 -0
README.md CHANGED
@@ -1,3 +1,260 @@
1
  ---
2
  license: bsd-3-clause
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause
3
  ---
4
+ ## Model Card: FLAVA
5
+
6
+ ## Model Details
7
+
8
+ FLAVA model was developed by the researchers at FAIR to understand if a single model can work across different modalities with a unified architectures. The model was developed solely using publicly available multimodal datasets containing 70M image-text pairs in total and thus fully reproducible. The model (i) similar to CLIP can be used for arbitrary image classification tasks in a zero-shot manner (ii) used for image or text retrieval in a zero-shot manner (iii) can also be fine-tuned for natural language understanding (NLU) tasks such as GLUE and vision-and-language reasoning tasks such as VQA v2. In the original paper, the authors evaluate FLAVA on 32 tasks from computer vision, NLU and vision-and-language domains and show impressive performance across the board scoring higher micro-average than CLIP while being open.
9
+
10
+ ## Model Date
11
+ Model was originally released in November 2021.
12
+
13
+ ## Model Type
14
+
15
+ The FLAVA model uses a ViT-B/32 transformer for both image encoder and text encoder. FLAVA also employs a multimodal encoder on top for multimodal tasks such as vision-and-language tasks (VQA) which is a 6-layer encoder. Each component of FLAVA model can be loaded individually from `facebook/flava-full` checkpoint. If you need complete heads used for pretraining, please use `FlavaForPreTraining` model class otherwise `FlavaModel` should suffice for most use case. This [repository](https://github.com/facebookresearch/multimodal/tree/main/examples/flava) also contains code to pretrain the FLAVA model from scratch.
16
+
17
+ ## Documents
18
+
19
+ - [FLAVA Paper](https://arxiv.org/abs/2112.04482)
20
+
21
+ ## Using with Transformers
22
+
23
+ ### FlavaModel
24
+
25
+ FLAVA model supports vision, language and multimodal inputs. You can pass corresponding inputs to the modality to get losses and outputs related to that domain.
26
+
27
+ ```py
28
+ from PIL import Image
29
+ import requests
30
+
31
+ from transformers import FlavaProcessor, FlavaModel
32
+
33
+ model = FlavaModel.from_pretrained("facebook/flava-full")
34
+ processor = FlavaProcessor.from_pretrained("facebook/flava-full")
35
+
36
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
37
+ image = Image.open(requests.get(url, stream=True).raw)
38
+
39
+ inputs = processor(
40
+ text=["a photo of a cat", "a photo of a dog"], images=[image, image], return_tensors="pt", padding="max_length", max_length=77
41
+ )
42
+
43
+ outputs = model(**inputs)
44
+ image_embeddings = outputs.image_embeddings # Batch size X (Number of image patches + 1) x Hidden size => 2 X 197 X 768
45
+ text_embeddings = outputs.text_embeddings # Batch size X (Text sequence length + 1) X Hidden size => 2 X 77 X 768
46
+ multimodal_embeddings = outputs.multimodal_embeddings # Batch size X (Number of image patches + Text Sequence Length + 3) X Hidden size => 2 X 275 x 768
47
+ # Multimodal embeddings can be used for multimodal tasks such as VQA
48
+
49
+
50
+ ## Pass only image
51
+ from transformers import FlavaFeatureExtractor
52
+
53
+ feature_extractor = FlavaFeatureExtractor.from_pretrained("facebook/flava-full")
54
+ inputs = feature_extractor(images=[image, image], return_tensors="pt")
55
+ outputs = model(**inputs)
56
+ image_embeddings = outputs.image_embeddings
57
+
58
+ ## Pass only image
59
+ from transformers import BertTokenizer
60
+
61
+ tokenizer = BertTokenizer.from_pretrained("facebook/flava-full")
62
+ inputs = tokenizer(["a photo of a cat", "a photo of a dog"], return_tensors="pt", padding="max_length", max_length=77)
63
+ outputs = model(**inputs)
64
+ text_embeddings = outputs.text_embeddings
65
+ ```
66
+
67
+ #### Encode Image
68
+
69
+ ```py
70
+ from PIL import Image
71
+ import requests
72
+
73
+ from transformers import FlavaFeatureExtractor, FlavaModel
74
+
75
+ model = FlavaModel.from_pretrained("facebook/flava-full")
76
+ feature_extractor = FlavaFeatureExtractor.from_pretrained("facebook/flava-full")
77
+
78
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
79
+ image = Image.open(requests.get(url, stream=True).raw)
80
+
81
+ inputs = feature_extractor(images=[image], return_tensors="pt")
82
+
83
+ image_embedding = model.get_image_features(**inputs)
84
+ ```
85
+
86
+ #### Encode Text
87
+
88
+ ```py
89
+ from PIL import Image
90
+
91
+ from transformers import BertTokenizer, FlavaModel
92
+
93
+ model = FlavaModel.from_pretrained("facebook/flava-full")
94
+ tokenizer = BertTokenizer.from_pretrained("facebook/flava-full")
95
+
96
+ inputs = tokenizer(text=["a photo of a dog"], return_tensors="pt", padding="max_length", max_length=77)
97
+
98
+ text_embedding = model.get_text_features(**inputs)
99
+ ```
100
+
101
+ ### FlavaForPreTraining
102
+
103
+ FLAVA model supports vision, language and multimodal inputs. You can pass corresponding inputs to modality to get losses and outputs related to that domain.
104
+
105
+ ```py
106
+ from PIL import Image
107
+ import requests
108
+
109
+ from transformers import FlavaProcessor, FlavaForPreTraining
110
+
111
+ model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
112
+ processor = FlavaProcessor.from_pretrained("facebook/flava-full")
113
+
114
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
115
+ image = Image.open(requests.get(url, stream=True).raw)
116
+
117
+ inputs = processor(
118
+ text=["a photo of a cat", "a photo of a dog"],
119
+ images=[image, image],
120
+ return_tensors="pt",
121
+ padding="max_length",
122
+ max_length=77,
123
+ return_codebook_pixels=True,
124
+ return_image_mask=True,
125
+ # Other things such as mlm_labels, itm_labels can be passed here. See docs
126
+ )
127
+ inputs.bool_masked_pos.zero_()
128
+
129
+ outputs = model(**inputs)
130
+ image_embeddings = outputs.image_embeddings # Batch size X (Number of image patches + 1) x Hidden size => 2 X 197 X 768
131
+ text_embeddings = outputs.text_embeddings # Batch size X (Text sequence length + 1) X Hidden size => 2 X 77 X 768
132
+ # Multimodal embeddings can be used for multimodal tasks such as VQA
133
+ multimodal_embeddings = outputs.multimodal_embeddings # Batch size X (Number of image patches + Text Sequence Length + 3) X Hidden size => 2 X 275 x 768
134
+
135
+ # Loss
136
+ loss = output.loss # probably NaN due to missing labels
137
+
138
+ # Global contrastive loss logits
139
+ image_contrastive_logits = outputs.contrastive_logits_per_image
140
+ text_contrastive_logits = outputs.contrastive_logits_per_text
141
+
142
+ # ITM logits
143
+ itm_logits = outputs.itm_logits
144
+
145
+ ```
146
+
147
+ ### FlavaImageModel
148
+
149
+ ```py
150
+ from PIL import Image
151
+ import requests
152
+
153
+ from transformers import FlavaFeatureExtractor, FlavaImageModel
154
+
155
+ model = FlavaImageModel.from_pretrained("facebook/flava-full")
156
+ feature_extractor = FlavaFeatureExtractor.from_pretrained("facebook/flava-full")
157
+
158
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
159
+ image = Image.open(requests.get(url, stream=True).raw)
160
+
161
+ inputs = feature_extractor(images=[image], return_tensors="pt")
162
+
163
+ outputs = model(**inputs)
164
+ image_embeddings = outputs.last_hidden_state
165
+ ```
166
+
167
+ ### FlavaTextModel
168
+
169
+ ```py
170
+ from PIL import Image
171
+
172
+ from transformers import BertTokenizer, FlavaTextModel
173
+
174
+ model = FlavaTextModel.from_pretrained("facebook/flava-full")
175
+ tokenizer = BertTokenizer.from_pretrained("facebook/flava-full")
176
+
177
+ inputs = tokenizer(text=["a photo of a dog"], return_tensors="pt", padding="max_length", max_length=77)
178
+
179
+ outputs = model(**inputs)
180
+ text_embeddings = outputs.last_hidden_state
181
+ ```
182
+
183
+ ## Model Use
184
+
185
+ ## Intended Use
186
+ The model is intended to serve as a reproducible research artifact for research communities in light of models whose exact reproduction details are never released such as [CLIP](https://github.com/openai/CLIP) and [SimVLM](https://arxiv.org/abs/2108.10904). FLAVA model performs equivalently to these models on most task while being trained on less (70M pairs compared to CLIP's 400M and SimVLM's 1.8B pairs respectively) but public data. We hope that this model enable communities to better understand, and explore zero-shot and arbitrary image classification, multi-domain pretraining, generic architectures while also providing a chance to develop on top.
187
+
188
+ ## Primary Intended Uses
189
+
190
+
191
+ The primary intended users of these models are AI researchers.
192
+
193
+ We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of foundation models which work across domains which in this case are vision, language and combined multimodal vision-and-language domain.
194
+
195
+ ## Out-of-Scope Use Cases
196
+
197
+ Similar to CLIP, **Any** deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. Though FLAVA is trained on open and public data which doesn't contain a lot of harmful data, users should still employ proper safety measures.
198
+
199
+ Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.
200
+
201
+ Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.
202
+
203
+ ## Data
204
+
205
+ FLAVA was pretrained on public available 70M image and text pairs. This includes datasets such as COCO, Visual Genome, Localized Narratives, RedCaps, a custom filtered subset of YFCC100M, SBUCaptions, Conceptual Captions and Wikipedia Image-Text datasets. A larger portion of this dataset comes from internet and thus can have bias towards people most connected to internet such as those from developed countries and younger, male users.
206
+
207
+ ## Data Mission Statement
208
+ Our goal with building this dataset called PMD (Public Multimodal Datasets) was two-fold (i) allow reproducibility of vision-language foundation models with publicly available data and (ii) test robustness and generalizability of FLAVA across the domains. The data was collected from already existing public dataset sources which have already been filtered out by original dataset curators to not contain adult and excessively violent contain. We will make the URLs of the images public for further research reproducibility but will not be hosting them.
209
+
210
+ ## Performance and Limitations
211
+ ## Performance
212
+
213
+ FLAVA has been evaluated on 35 different tasks from computer vision, natural language understanding, and vision-and-language reasoning.
214
+ On COCO and Flickr30k retrieval, we report zero-shot accuracy, on image tasks, we report linear-eval and on rest of the tasks, we report fine-tuned accuracies. Generally, FLAVA works much better than CLIP where tasks require good text understanding. The paper describes more in details but following are the 35 datasets:
215
+
216
+ ### Natural Language Understanding
217
+ - MNLI
218
+ - CoLA
219
+ - MRPC
220
+ - QQP
221
+ - SST-2
222
+ - QNLI
223
+ - RTE
224
+ - STS-B
225
+
226
+ ### Image Understanding
227
+
228
+ - ImageNet
229
+ - Food100
230
+ - CIFAR10
231
+ - CIFAR100
232
+ - Cars
233
+ - Aircraft
234
+ - DTD
235
+ - Pets
236
+ - Caltech101
237
+ - Flowers102
238
+ - MNIST
239
+ - STL10
240
+ - EuroSAT
241
+ - GTSRB
242
+ - KITTI
243
+ - PCAM
244
+ - UCF101
245
+ - CLEVR
246
+ - FER 2013
247
+ - SUN397
248
+ - Image SST
249
+ - Country 211
250
+
251
+ ### Vision and Language Reasoning
252
+ - VQA v2
253
+ - SNLI-VE
254
+ - Hateful Memes
255
+ - Flickr30K Retrieval
256
+ - COCO Retrieval
257
+
258
+ ## Limitations
259
+
260
+ Currently, FLAVA has many limitations. The image classification accuracy is not on par with CLIP on some of the tasks while text accuracy is not on par with BERT on some of the tasks suggesting possible room for improvement. FLAVA also doesn't work well on tasks containing scene text given the lack of scene text in most public datasets. Additionally, similar to CLIP, our approach to testing FLAVA also has an important limitation in the case of image tasks, where we use linear probes to evaluate FLAVA and there is evidence suggesting that linear probes can underestimate model performance.