HuggingFaceM4
/

idefics-80b

@@ -115,19 +115,18 @@ The model is trained on the following data mixture of openly accessible English
 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
-| [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | TODO                      | TODO                      | 3      | 73.85%                                  |
-| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | TODO                      | TODO                      | 1      | 6.15%                                   |
-| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 3      | 2.82%                                   |
-| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 1      | 17.18%                                  |
-**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
-**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
-**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.

 | Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
 |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
+| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC)     | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 1      | 73.85%                                  |
+| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | TODO                      | TODO                      | 3      | 17.18%                                  |
+| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | TODO                      | TODO                      | 1      | 6.15%
+| [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | TODO                      | TODO                      | 3      | 2.82%                                   |                                |
+**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
+**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
+**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
 For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.