openflamingo
/

OpenFlamingo-4B-vitl-rpj3b

English

Model card Files Files and versions Community

Irena Gao commited on Jun 14, 2023

Commit

fae8652

•

1 Parent(s): b6fcc82

update README

Browse files

Files changed (1) hide show

README.md +145 -14

README.md CHANGED Viewed

@@ -12,10 +12,149 @@ OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://w
 This 4B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [RedPajama-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) language model.
 ## Model Details
-We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we use a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).
 ## Uses
 OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
 ### Bias, Risks, and Limitations
 OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
@@ -42,11 +181,11 @@ In an effort to mitigate current potential biases and harms, we have deployed a
   </tr>
   <tr>
     <th>VQAv2 (Accuracy)</th>
-    <td>44.0 (0.3)</td>
-    <td>47.0 (0.3)</td>
-    <td>45.2 (0.1)</td>
-    <td>44.1 (0.3)</td>
-    <td>41.9 (0.7)</td>
   </tr>
   <tr>
     <th>Flickr-30K (CIDEr)</th>
@@ -80,14 +219,6 @@ In an effort to mitigate current potential biases and harms, we have deployed a
     <td>34.2 (1.4)</td>
     <td>39.9 (0.6)</td>
   </tr>
-  <tr>
-    <th>ImageNet (Top-1 Accuracy)</th>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-    <td>-</td>
-  </tr>
   <tr>
     <th>Hateful Memes (ROC AUC)</th>
     <td>-</td>

 This 4B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [RedPajama-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1) language model.
 ## Model Details
+We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we trained this model on a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402), [Multimodal C4](https://arxiv.org/abs/2304.06939), and custom ChatGPT-generated sequences using images from LAION (to be released soon).
+This model has cross-attention modules inserted in *every other* decoder block. It was trained using FullyShardedDataParallel across 64 A100 40GB GPUs at FP32 precision.
 ## Uses
 OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification.
+### Generation example
+Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
+``` python
+from PIL import Image
+import requests
+"""
+Step 1: Load images
+"""
+demo_image_one = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
+    ).raw
+)
+demo_image_two = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
+        stream=True
+    ).raw
+)
+query_image = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
+        stream=True
+    ).raw
+)
+"""
+Step 2: Preprocessing images
+Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
+ batch_size x num_media x num_frames x channels x height x width.
+ In this case batch_size = 1, num_media = 3, num_frames = 1,
+ channels = 3, height = 224, width = 224.
+"""
+vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
+vision_x = torch.cat(vision_x, dim=0)
+vision_x = vision_x.unsqueeze(1).unsqueeze(0)
+"""
+Step 3: Preprocessing text
+Details: In the text we expect an <image> special token to indicate where an image is.
+ We also expect an <|endofchunk|> special token to indicate the end of the text
+ portion associated with an image.
+"""
+tokenizer.padding_side = "left" # For generation padding tokens should be on the left
+lang_x = tokenizer(
+    ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
+    return_tensors="pt",
+)
+"""
+Step 4: Generate text
+"""
+generated_text = model.generate(
+    vision_x=vision_x,
+    lang_x=lang_x["input_ids"],
+    attention_mask=lang_x["attention_mask"],
+    max_new_tokens=20,
+    num_beams=3,
+)
+print("Generated text: ", tokenizer.decode(generated_text[0]))
+```
+### Generation example
+Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.
+``` python
+from PIL import Image
+import requests
+"""
+Step 1: Load images
+"""
+demo_image_one = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
+    ).raw
+)
+demo_image_two = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
+        stream=True
+    ).raw
+)
+query_image = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
+        stream=True
+    ).raw
+)
+"""
+Step 2: Preprocessing images
+Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
+ batch_size x num_media x num_frames x channels x height x width.
+ In this case batch_size = 1, num_media = 3, num_frames = 1,
+ channels = 3, height = 224, width = 224.
+"""
+vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
+vision_x = torch.cat(vision_x, dim=0)
+vision_x = vision_x.unsqueeze(1).unsqueeze(0)
+"""
+Step 3: Preprocessing text
+Details: In the text we expect an <image> special token to indicate where an image is.
+ We also expect an <|endofchunk|> special token to indicate the end of the text
+ portion associated with an image.
+"""
+tokenizer.padding_side = "left" # For generation padding tokens should be on the left
+lang_x = tokenizer(
+    ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
+    return_tensors="pt",
+)
+"""
+Step 4: Generate text
+"""
+generated_text = model.generate(
+    vision_x=vision_x,
+    lang_x=lang_x["input_ids"],
+    attention_mask=lang_x["attention_mask"],
+    max_new_tokens=20,
+    num_beams=3,
+)
+print("Generated text: ", tokenizer.decode(generated_text[0]))
+```
 ### Bias, Risks, and Limitations
 OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.
   </tr>
   <tr>
     <th>VQAv2 (Accuracy)</th>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
+    <td>-</td>
   </tr>
   <tr>
     <th>Flickr-30K (CIDEr)</th>
     <td>34.2 (1.4)</td>
     <td>39.9 (0.6)</td>
   </tr>
   <tr>
     <th>Hateful Memes (ROC AUC)</th>
     <td>-</td>