English
File size: 3,478 Bytes
e998648
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language: en
datasets:
- laion2b
---

# OpenFlamingo-9B (CLIP ViT-L/14, MPT-7B)

[Blog post]() | [Code](https://github.com/mlfoundations/open_flamingo) | [Demo]()

OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) models. 
This 9B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) language model.

## Model Details
We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we use a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402) and [Multimodal C4](https://arxiv.org/abs/2304.06939).  

## Uses
OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification. 

### Bias, Risks, and Limitations
OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.

In an effort to mitigate current potential biases and harms, we have deployed a text content filter on model outputs in the OpenFlamingo demo. We continue to red-team the model to understand and improve its safety.

## Evaluation
<table>
  <tr>
    <th></th>
    <th>0-shot</th>
    <th>4-shot</th>
    <th>8-shot</th>
    <th>16-shot</th>
    <th>32-shot</th>
  </tr>
  <tr>
    <th>COCO (CIDEr)</th>
    <td>79.5 (0.2)</td>
    <td>89.0 (0.3)</td>
    <td>96.3 (0.1)</td>
    <td>98.8 (0.7)</td>
    <td>99.5 (0.1)</td>
  </tr>
  <tr>
    <th>VQAv2 (Accuracy)</th>
    <td>48.3 (0.1)</td>
    <td>49.4 (0.4)</td>
    <td>51.8 (0.4)</td>
    <td>51.3 (0.5)</td>
    <td>50.2 (0.6)</td>
  </tr>
  <tr>
    <th>Flickr-30K (CIDEr)</th>
    <td>59.5 (1.0)</td>
    <td>65.8 (0.6)</td>
    <td>62.9 (1.0)</td>
    <td>62.8 (1.0)</td>
    <td>61.3 (0.7)</td>
  </tr>
  <tr>
    <th>OK-VQA (Accuracy)</th>
    <td>34.7 (0.1)</td>
    <td>34.3 (0.1)</td>
    <td>38.4 (0.0)</td>
    <td>39.5 (0.1)</td>
    <td>38.1 (0.0)</td>
  </tr>
  <tr>
    <th>TextVQA (Accuracy)</th>
    <td>24.2 (0.5)</td>
    <td>28.2 (0.4)</td>
    <td>29.1 (0.1)</td>
    <td>27.3 (0.1)</td>
    <td>23.8 (0.2)</td>
  </tr>
  <tr>
    <th>Vizwiz (Accuracy)</th>
    <td>17.7 (0.7)</td>
    <td>23.1 (0.9)</td>
    <td>31.6 (1.5)</td>
    <td>38.0 (1.1)</td>
    <td>40.2 (0.7)</td>
  </tr>
  <tr>
    <th>ImageNet (Top-1 Accuracy)</th>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <th>Hateful Memes (ROC AUC)</th>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
  </tr>
</table