stas VictorSanh commited on
Commit
66a38fb
1 Parent(s): d85c074

first pass over the model card (#1)

Browse files

- first pass over the model card (d6110ce1f7f896624ad228c39512efed7ec61e0e)


Co-authored-by: Victor Sanh <VictorSanh@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +64 -114
README.md CHANGED
@@ -4,20 +4,25 @@ tags:
4
  - multimodal
5
  - text
6
  - image
7
- license: "apache-2.0"
8
  datasets:
9
- - obelisc
 
 
 
10
  ---
11
 
12
 
13
-
14
 
15
  # Model Card for m4-80b
16
 
17
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
18
- Some cool model...
19
-
20
 
 
 
21
 
22
 
23
  # Table of Contents
@@ -60,118 +65,92 @@ Some cool model...
60
 
61
  # Model Details
62
 
63
- ## Model Description
64
-
65
- <!-- Provide a longer summary of what this model is/does. -->
66
- Some cool model...
67
-
68
- - **Developed by:** HuggingFace
69
  - **Model type:** Multi-modal model (text+image)
70
  - **Language(s) (NLP):** en
71
- - **License:** apache-2.0
72
- - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggingface/llama-65b](https://huggingface.co/huggingface/llama-65b)
73
- - **Resources for more information:** More information needed
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
- - Associated Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
76
-
77
- # Uses
78
-
79
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
80
-
81
- ## Direct Use
82
-
83
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
84
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
85
 
 
 
86
 
 
87
 
 
 
88
 
89
- ## Downstream Use [Optional]
90
-
91
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
92
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
93
-
94
-
95
-
96
-
97
- ## Out-of-Scope Use
98
-
99
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
100
- <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
101
-
102
-
103
 
 
104
 
105
- # Bias, Risks, and Limitations
106
 
107
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
108
 
109
- Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
110
 
111
 
112
- ## Recommendations
113
 
114
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
115
 
 
 
116
 
 
117
 
 
118
 
119
 
120
  # Training Details
121
 
122
- ## Training Data
123
 
124
- <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
125
 
126
- More information on training data needed
 
 
 
 
 
127
 
 
 
 
 
128
 
129
- ## Training Procedure
130
-
131
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
132
-
133
- ### Preprocessing
134
-
135
- More information needed
136
-
137
- ### Speeds, Sizes, Times
138
-
139
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
140
-
141
- More information needed
142
 
143
  # Evaluation
144
 
145
  <!-- This section describes the evaluation protocols and provides the results. -->
 
 
146
 
147
- ## Testing Data, Factors & Metrics
148
-
149
- ### Testing Data
150
 
151
- <!-- This should link to a Data Card if possible. -->
152
 
153
- More information needed
154
-
155
-
156
- ### Factors
157
-
158
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
159
-
160
- More information needed
161
 
162
- ### Metrics
163
 
164
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
165
 
166
- More information needed
 
 
167
 
168
- ## Results
 
169
 
170
- More information needed
 
171
 
172
- # Model Examination
173
-
174
- More information needed
175
 
176
  # Environmental Impact
177
 
@@ -184,26 +163,17 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
184
  - **Cloud Provider:** AWS Sagemaker
185
  - **Carbon Emitted:** unknown
186
 
187
- # Technical Specifications [optional]
188
 
189
- ## Model Architecture and Objective
190
 
191
- More information needed
192
-
193
- ## Compute Infrastructure
194
-
195
- More information needed
196
-
197
- ### Hardware
198
-
199
- The training was performed on AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
200
 
201
  As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.
202
 
 
203
 
204
- ### Software
205
-
206
- The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3. Plus [WebDataset](https://github.com/webdataset/webdataset) for data loading.
207
 
208
 
209
  # Citation
@@ -218,15 +188,6 @@ More information needed
218
 
219
  More information needed
220
 
221
- # Glossary [optional]
222
-
223
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
224
-
225
- More information needed
226
-
227
- # More Information [optional]
228
-
229
- More information needed
230
 
231
  # Model Card Authors [optional]
232
 
@@ -236,15 +197,4 @@ V, i, c, t, o, r, ,, , S, t, a, s, ,, , X, X, X
236
 
237
  # Model Card Contact
238
 
239
- More information needed
240
-
241
- # How to Get Started with the Model
242
-
243
- Use the code below to get started with the model.
244
-
245
- <details>
246
- <summary> Click to expand </summary>
247
-
248
- More information needed
249
-
250
- </details>
 
4
  - multimodal
5
  - text
6
  - image
7
+ license: other
8
  datasets:
9
+ - HuggingFaceM4/OBELISC
10
+ - wikipedia
11
+ - facebook/pmd
12
+ - laion/laion2B-en
13
  ---
14
 
15
 
16
+ TODO: logo?
17
 
18
  # Model Card for m4-80b
19
 
20
  <!-- Provide a quick summary of what the model is/does. [Optional] -->
21
+ ATUM (**A**dapted **T**ransformers for **U**nstructured **M**ultimodal data) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models.
22
+ ATUM (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
23
 
24
+ The model comes into two variants: a large [80 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-80b) and a [9 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-9b).
25
+ We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
26
 
27
 
28
  # Table of Contents
 
65
 
66
  # Model Details
67
 
68
+ - **Developed by:** Hugging Face
 
 
 
 
 
69
  - **Model type:** Multi-modal model (text+image)
70
  - **Language(s) (NLP):** en
71
+ - **License:** other
72
+ - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
73
+ - **Resources for more information:**
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
+ - Description of [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC): [OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
76
+ ](https://arxiv.org/abs/2306.16527)
77
+ - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
 
 
 
 
 
 
 
78
 
79
+ ATUM is a large multimodal model that takes sequences of interleaved images and texts as inputs and generates text outputs.
80
+ The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
81
 
82
+ ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
83
 
84
+
85
+ # Uses
86
 
87
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
+ The model can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query/instruction along with one or multiple images. This model does not support image generation.
90
 
91
+ It is possible to fine-tune the base model on custom data for a specific use-case. We note that the instruction-fine-tuned models are significantly better at following instructions and thus should be prefered when using the models out-of-the-box.
92
 
93
+ The following screenshot is an example of interaction with the model:
94
 
95
+ TODO: screenshot
96
 
97
 
98
+ # How to Get Started with the Model
99
 
100
+ Use the code below to get started with the model.
101
 
102
+ <details>
103
+ <summary> Click to expand </summary>
104
 
105
+ More information needed
106
 
107
+ </details>
108
 
109
 
110
  # Training Details
111
 
112
+ We closel follow the training procedure layed out in [Flamingo](https://arxiv.org/abs/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks.
113
 
114
+ The model is trained on the following data mixture of openly accessible data:
115
 
116
+ | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
+ |-------------|----------------------------|------------|----------------------------|--------|
118
+ | PMD | Image-Text Pairs | TODO | TODO | 3 | 73.85% |
119
+ | LAION | Image-Text Pairs | TODO | TODO | 1 | 6.15% |
120
+ | OBELISC | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 2.82% |
121
+ | Wikipedia | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 17.18% |
122
 
123
+ For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images.
124
+ For image-text pairs, we form the training sequences by packing images with their captions.
125
+ The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
126
+ The training objective is the standard next token prediction.
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
  # Evaluation
130
 
131
  <!-- This section describes the evaluation protocols and provides the results. -->
132
+ We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
133
+ We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
134
 
135
+ TODO: beautiful plots of shots scaling laws.
 
 
136
 
137
+ TODO: detail of the numbers in a table.
138
 
 
 
 
 
 
 
 
 
139
 
140
+ # Bias, Risks, and Limitations
141
 
142
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
143
 
144
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
145
+ As a derivative of such a language model, ATUM can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
146
+ Moreover, ATUM can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
147
 
148
+ Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
149
+ TODO: give 4/5 representative examples
150
 
151
+ To measure ATUM's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
152
+ TODO: include FairFace numbers
153
 
 
 
 
154
 
155
  # Environmental Impact
156
 
 
163
  - **Cloud Provider:** AWS Sagemaker
164
  - **Carbon Emitted:** unknown
165
 
166
+ # Technical Specifications
167
 
168
+ ## Hardware
169
 
170
+ The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
 
 
 
 
 
 
 
 
171
 
172
  As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.
173
 
174
+ ## Software
175
 
176
+ The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and [WebDataset](https://github.com/webdataset/webdataset) for data loading.
 
 
177
 
178
 
179
  # Citation
 
188
 
189
  More information needed
190
 
 
 
 
 
 
 
 
 
 
191
 
192
  # Model Card Authors [optional]
193
 
 
197
 
198
  # Model Card Contact
199
 
200
+ Please open a discussion on the Community tab!