Leyo commited on
Commit
436c345
1 Parent(s): 2f0f4fc

fix proportion numbers

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -116,8 +116,8 @@ The model is trained on the following data mixture of openly accessible English
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
  | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 73.85% |
119
- | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 17.18% |
120
- | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 6.15%
121
  | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 2.82% | |
122
 
123
  **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
 
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
  | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 73.85% |
119
+ | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 6.15% |
120
+ | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 17.18%
121
  | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 2.82% | |
122
 
123
  **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).