English
Irena Gao commited on
Commit
fae8652
1 Parent(s): b6fcc82

update README

Browse files
Files changed (1) hide show
  1. README.md +145 -14
README.md CHANGED
@@ -12,10 +12,149 @@ OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://w
12
  This 4B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [RedPajama-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) language model.
13
 
14
  ## Model Details
15
- We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we use a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).
 
 
16
 
17
  ## Uses
18
  OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ### Bias, Risks, and Limitations
21
  OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
@@ -42,11 +181,11 @@ In an effort to mitigate current potential biases and harms, we have deployed a
42
  </tr>
43
  <tr>
44
  <th>VQAv2 (Accuracy)</th>
45
- <td>44.0 (0.3)</td>
46
- <td>47.0 (0.3)</td>
47
- <td>45.2 (0.1)</td>
48
- <td>44.1 (0.3)</td>
49
- <td>41.9 (0.7)</td>
50
  </tr>
51
  <tr>
52
  <th>Flickr-30K (CIDEr)</th>
@@ -80,14 +219,6 @@ In an effort to mitigate current potential biases and harms, we have deployed a
80
  <td>34.2 (1.4)</td>
81
  <td>39.9 (0.6)</td>
82
  </tr>
83
- <tr>
84
- <th>ImageNet (Top-1 Accuracy)</th>
85
- <td>-</td>
86
- <td>-</td>
87
- <td>-</td>
88
- <td>-</td>
89
- <td>-</td>
90
- </tr>
91
  <tr>
92
  <th>Hateful Memes (ROC AUC)</th>
93
  <td>-</td>
 
12
  This 4B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [RedPajama-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) language model.
13
 
14
  ## Model Details
15
+ We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we trained this model on a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402), [Multimodal C4](https://arxiv.org/abs/2304.06939), and custom ChatGPT-generated sequences using images from LAION (to be released soon).
16
+
17
+ This model has cross-attention modules inserted in *every other* decoder block. It was trained using FullyShardedDataParallel across 64 A100 40GB GPUs at FP32 precision.
18
 
19
  ## Uses
20
  OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
21
+ ### Generation example
22
+ Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
23
+
24
+ ``` python
25
+ from PIL import Image
26
+ import requests
27
+
28
+ """
29
+ Step 1: Load images
30
+ """
31
+ demo_image_one = Image.open(
32
+ requests.get(
33
+ "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
34
+ ).raw
35
+ )
36
+
37
+ demo_image_two = Image.open(
38
+ requests.get(
39
+ "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
40
+ stream=True
41
+ ).raw
42
+ )
43
+
44
+ query_image = Image.open(
45
+ requests.get(
46
+ "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
47
+ stream=True
48
+ ).raw
49
+ )
50
+
51
+
52
+ """
53
+ Step 2: Preprocessing images
54
+ Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
55
+ batch_size x num_media x num_frames x channels x height x width.
56
+ In this case batch_size = 1, num_media = 3, num_frames = 1,
57
+ channels = 3, height = 224, width = 224.
58
+ """
59
+ vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
60
+ vision_x = torch.cat(vision_x, dim=0)
61
+ vision_x = vision_x.unsqueeze(1).unsqueeze(0)
62
+
63
+ """
64
+ Step 3: Preprocessing text
65
+ Details: In the text we expect an <image> special token to indicate where an image is.
66
+ We also expect an <|endofchunk|> special token to indicate the end of the text
67
+ portion associated with an image.
68
+ """
69
+ tokenizer.padding_side = "left" # For generation padding tokens should be on the left
70
+ lang_x = tokenizer(
71
+ ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
72
+ return_tensors="pt",
73
+ )
74
+
75
+
76
+ """
77
+ Step 4: Generate text
78
+ """
79
+ generated_text = model.generate(
80
+ vision_x=vision_x,
81
+ lang_x=lang_x["input_ids"],
82
+ attention_mask=lang_x["attention_mask"],
83
+ max_new_tokens=20,
84
+ num_beams=3,
85
+ )
86
+
87
+ print("Generated text: ", tokenizer.decode(generated_text[0]))
88
+ ```
89
+
90
+ ### Generation example
91
+ Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
92
+
93
+ ``` python
94
+ from PIL import Image
95
+ import requests
96
+
97
+ """
98
+ Step 1: Load images
99
+ """
100
+ demo_image_one = Image.open(
101
+ requests.get(
102
+ "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
103
+ ).raw
104
+ )
105
+
106
+ demo_image_two = Image.open(
107
+ requests.get(
108
+ "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
109
+ stream=True
110
+ ).raw
111
+ )
112
+
113
+ query_image = Image.open(
114
+ requests.get(
115
+ "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
116
+ stream=True
117
+ ).raw
118
+ )
119
+
120
+
121
+ """
122
+ Step 2: Preprocessing images
123
+ Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
124
+ batch_size x num_media x num_frames x channels x height x width.
125
+ In this case batch_size = 1, num_media = 3, num_frames = 1,
126
+ channels = 3, height = 224, width = 224.
127
+ """
128
+ vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
129
+ vision_x = torch.cat(vision_x, dim=0)
130
+ vision_x = vision_x.unsqueeze(1).unsqueeze(0)
131
+
132
+ """
133
+ Step 3: Preprocessing text
134
+ Details: In the text we expect an <image> special token to indicate where an image is.
135
+ We also expect an <|endofchunk|> special token to indicate the end of the text
136
+ portion associated with an image.
137
+ """
138
+ tokenizer.padding_side = "left" # For generation padding tokens should be on the left
139
+ lang_x = tokenizer(
140
+ ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
141
+ return_tensors="pt",
142
+ )
143
+
144
+
145
+ """
146
+ Step 4: Generate text
147
+ """
148
+ generated_text = model.generate(
149
+ vision_x=vision_x,
150
+ lang_x=lang_x["input_ids"],
151
+ attention_mask=lang_x["attention_mask"],
152
+ max_new_tokens=20,
153
+ num_beams=3,
154
+ )
155
+
156
+ print("Generated text: ", tokenizer.decode(generated_text[0]))
157
+ ```
158
 
159
  ### Bias, Risks, and Limitations
160
  OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
 
181
  </tr>
182
  <tr>
183
  <th>VQAv2 (Accuracy)</th>
184
+ <td>-</td>
185
+ <td>-</td>
186
+ <td>-</td>
187
+ <td>-</td>
188
+ <td>-</td>
189
  </tr>
190
  <tr>
191
  <th>Flickr-30K (CIDEr)</th>
 
219
  <td>34.2 (1.4)</td>
220
  <td>39.9 (0.6)</td>
221
  </tr>
 
 
 
 
 
 
 
 
222
  <tr>
223
  <th>Hateful Memes (ROC AUC)</th>
224
  <td>-</td>