HuggingFaceM4
/

idefics-80b

@@ -6,7 +6,7 @@ tags:
 - image
 license: other
 datasets:
-- HuggingFaceM4/OBELISC
 - wikipedia
 - facebook/pmd
 - laion/laion2B-en
@@ -18,8 +18,8 @@ TODO: logo?
 # Model Card for m4-80b
 <!-- Provide a quick summary of what the model is/does. [Optional] -->
-ATUM (**A**dapted **T**ransformers for **U**nstructured **M**ultimodal data) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models.
-ATUM (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
 The model comes into two variants: a large [80 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-80b) and a [9 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-9b).
 We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
@@ -72,14 +72,14 @@ We also fine-tune these base models on a mixture of SFT datasets (TODO: find a m
 - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
 - **Resources for more information:**
     - [GitHub Repo](https://github.com/huggingface/m4/)
-    - Description of [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC): [OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
 ](https://huggingface.co/papers/2306.16527)
     - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)
-ATUM is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
 The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
-ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
 # Uses
@@ -117,12 +117,12 @@ The model is trained on the following data mixture of openly accessible English
 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
-| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | 114.9B                      | 353M                      | 1      | 73.85%                                  |
 | [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | 39M                     | 3      | 6.15%                                  |
 | [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 29.9B                      | 1.120B                      | 1      | 17.18%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 1.6B                      | 70M                      | 3      | 2.82%                                   |                                |
-**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
@@ -137,7 +137,7 @@ Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we
 The training objective is the standard next token prediction.
 We use the following hyper and training parameters:
-| Parameters | | ATUM | ATUM-9b |
 | -- | -- | -- | -- |
 | Perceiver Resampler | Number of Layers | 6 | 6 |
 | | Number of Latents | 64 | 64 |
@@ -165,11 +165,11 @@ We use the following hyper and training parameters:
 # Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
-We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
 We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
-We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for ATUM-9B and at step 37'500 for ATUM. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.
 TODO: beautiful plots of shots scaling laws.
@@ -205,13 +205,13 @@ The training software is built on top of HuggingFace Transformers + Accelerate,
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
-As a derivative of such a language model, ATUM can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
-Moreover, ATUM can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
 Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
 TODO: give 4/5 representative examples
-To measure ATUM's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
 TODO: include FairFace numbers

 - image
 license: other
 datasets:
+- HuggingFaceM4/OBELICS
 - wikipedia
 - facebook/pmd
 - laion/laion2B-en
 # Model Card for m4-80b
 <!-- Provide a quick summary of what the model is/does. [Optional] -->
+IDEFICS (**I**mage-aware **D**ecoder **E**nhanced à la **F**lamingo with **I**nterleaved **C**ross-attention**S**) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models.
+IDEFICS (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.
 The model comes into two variants: a large [80 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-80b) and a [9 billion parameters version](https://huggingface.co/HuggingFaceM4/m4-9b).
 We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).
 - **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
 - **Resources for more information:**
     - [GitHub Repo](https://github.com/huggingface/m4/)
+    - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
 ](https://huggingface.co/papers/2306.16527)
     - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)
+IDEFICS is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
 The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
+IDEFICS is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
 # Uses
 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
+| [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)     | Unstructured Multimodal Web Documents    | 114.9B                      | 353M                      | 1      | 73.85%                                  |
 | [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | 39M                     | 3      | 6.15%                                  |
 | [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 29.9B                      | 1.120B                      | 1      | 17.18%
 | [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 1.6B                      | 70M                      | 3      | 2.82%                                   |                                |
+**OBELICS** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
 The training objective is the standard next token prediction.
 We use the following hyper and training parameters:
+| Parameters | | IDEFICS | IDEFICS-9b |
 | -- | -- | -- | -- |
 | Perceiver Resampler | Number of Layers | 6 | 6 |
 | | Number of Latents | 64 | 64 |
 # Evaluation
 <!-- This section describes the evaluation protocols and provides the results. -->
+We closely follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
 We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
+We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for IDEFICS-9B and at step 37'500 for IDEFICS. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.
 TODO: beautiful plots of shots scaling laws.
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
+As a derivative of such a language model, IDEFICS can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
+Moreover, IDEFICS can produce factually incorrect texts, and should not be relied on to produce factually accurate information.
 Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
 TODO: give 4/5 representative examples
+To measure IDEFICS's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
 TODO: include FairFace numbers