vidore
/

colqwen2-v0.1

vidore-experimental

Model card Files Files and versions Community

manu commited on Sep 27

Commit

6b9ef3c

•

1 Parent(s): ab6b403

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ Our training dataset of 127,460 query-page pairs is comprised of train sets of o
 Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
 A validation set is created with 2% of the samples to tune hyperparameters.
-*Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training.*
 ### Parameters

 Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
 A validation set is created with 2% of the samples to tune hyperparameters.
+*Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.*
 ### Parameters