Spaces:

flax-community
/

Multilingual-VQA

Runtime error

App Files Files Community

gchhablani commited on Jul 24, 2021

Commit

1d86395

1 Parent(s): 53ddc87

Update data

Browse files

Files changed (1) hide show

sections/pretraining/data.md +1 -1

sections/pretraining/data.md CHANGED Viewed

@@ -1 +1 @@

- The dataset we use for pre-training is a cleaned version of [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m). The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language.

+ The dataset we use for pre-training is a cleaned version of [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m). The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 `mbart-large-50-one-to-many-mmt` checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language. This dataset is used for MLM pre-training.