lusxvr commited on
Commit
d0fd157
·
1 Parent(s): bf50da8

Added notes and changed to wide

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +13 -7
app/src/content/article.mdx CHANGED
@@ -27,9 +27,15 @@ import visualPoster from "./assets/images/visual-vocabulary-poster.png";
27
  import Accordion from '../components/Accordion.astro'
28
 
29
  <Sidenote>
30
- TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
31
 
32
- Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives, and achieved better model performance and higher quantity and diversity of data.
 
 
 
 
 
 
33
  </Sidenote>
34
 
35
  ## Introduction
@@ -39,7 +45,7 @@ Even though open-weights Vision-Language Models (VLMs) are becoming ever more po
39
  ### Data Collection
40
  We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
41
 
42
- <FullWidth>
43
  <Accordion title="FineVision Subsets">
44
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
45
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
@@ -229,7 +235,7 @@ We manually collect over 180 image-text datasets from the recent literature and
229
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
230
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
231
  </Accordion>
232
- </FullWidth>
233
 
234
  ### Cleaning
235
  After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
@@ -241,9 +247,9 @@ There are multiple ways to count the data in a multimodal dataset. The most comm
241
 
242
  In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
243
 
244
- <FullWidth>
245
  <HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
246
- </FullWidth>
247
 
248
  ## Experimental Setup
249
  To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
@@ -261,7 +267,7 @@ To evaluate our ablations in a reproducible manner, we utilize lmms-eval during
261
  Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
262
 
263
  ### How does FineVision compare against the Baselines?
264
- Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options.
265
 
266
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
267
 
 
27
  import Accordion from '../components/Accordion.astro'
28
 
29
  <Sidenote>
30
+ TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens comprising 5TB. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
31
 
32
+ Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives. Our dataset is both more divers, and achieves an average improvement of 35% in 10 common benchmarks over all baselines.
33
+
34
+ To use the dataset, simply load it with:
35
+ ```python
36
+ from datasets import load_dataset
37
+ ds = load_dataset('HuggingFaceM4/FineVision', name='ai2d_merged', split='train', streaming=True)
38
+ ```
39
  </Sidenote>
40
 
41
  ## Introduction
 
45
  ### Data Collection
46
  We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
47
 
48
+ <Wide>
49
  <Accordion title="FineVision Subsets">
50
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
51
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
 
235
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
236
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
237
  </Accordion>
238
+ </Wide>
239
 
240
  ### Cleaning
241
  After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
 
247
 
248
  In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
249
 
250
+ <Wide>
251
  <HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
252
+ </Wide>
253
 
254
  ## Experimental Setup
255
  To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
 
267
  Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
268
 
269
  ### How does FineVision compare against the Baselines?
270
+ Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options. Over the 10 different metrics, FineVision achieves a 45.68% improvement over the Cauldron, a 13.04% improvement over Cambrian, and a 46.83% improvement over LLaVa.
271
 
272
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
273