# Datasets We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so. ## Pretraining * A small subset of the pretraining data The pretraining datasets used in OFA are all publicly available. Here we provide the public links to these data, it is recommended that you download the data from the links first, and then process the downloaded dataset into a similar format as the examples we provided. - _CC12M_: https://github.com/google-research-datasets/conceptual-12m - _CC3M_: https://github.com/google-research-datasets/conceptual-captions - _SBU_: https://www.cs.virginia.edu/~vicente/sbucaptions - _COCO_: https://cocodataset.org/#home - _VG_: https://visualgenome.org/ - _VQAv2_: https://visualqa.org/ - _GQA_: https://cs.stanford.edu/people/dorarad/gqa/about.html - _RefCOCO_/_RefCOCO+_/RefCOCOg: https://github.com/lichengunc/refer - _OpenImages_: https://storage.googleapis.com/openimages/web/index.html - _Object365_: https://www.objects365.org/overview.html - _YFCC100M (subset)_: https://github.com/openai/CLIP/blob/main/data/yfcc100m.md - _ImageNet-21K_: https://image-net.org/index.php - _Pile_: https://pile.eleuther.ai ## Vision & Language Tasks * Dataset for Caption * Dataset for RefCOCO * Dataset for RefCOCO+ * Dataset for RefCOCOg * Dataset for VQAv2 (we have also provided chunked parts of the dataset files for more convenient downloading, please refer to issue #68) * Dataset for SNLI-VE * Dataset for Text-to-Image Genearion * Dataset for Text-to-Image Genearion (with original id) ## Vision Tasks * Dataset for ImageNet-1K ## Language Tasks * Dataset for COLA * Dataset for MNLI * Dataset for MRPC * Dataset for QNLI * Dataset for QQP * Dataset for RTE * Dataset for SST2 * Dataset for Gigaword