arxiv:2209.06794

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Published on Sep 14, 2022

Authors:

Lucas Beyer ,

Abstract

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Nov 23, 2023

Introduces PaLI (Pathways Language and Image model): Large encoder-decoder language models (mT5-XXL - 13B) and ViTs, proposes ViT-e (enormous - 4B parameters - to match size of language models); WebLI: multilingual 10B image-text training set (100+ languages); text-image input and text output. ViT converts image to tokens, transformer encoder (of mT5) takes text and (transferred) image tokens, embeddings given to decoder which gives text output. Trained with a mixture of 8 pre-training tasks (span corruption on text, split captioning on WebLI alt-text data, captioning on CC3M-35L, OCR-text, English and cross-lingual VQA, visual question generation, object-aware VQA, object detection); teacher-forcing with softmax cross-entropy loss. Best CIDEr score for English image captioning (compared to GIT2, OFA, Flamingo), also works for multi-lingual captioning; better English VQA accuracy (compared to GIT2, Flamingo, BEiT-3), and cross-lingual VQA on xGQA and MaXM (compared to MPT); on-par with mT5-XXL on XTREME (language modeling) - doesn't forget language modeling capabilities; zero-shot ImageNet classification. Scaling both language and image models gives benefits. Ablation contains the training task mixture, training from loaded checkpoints (of individual models) vs. from scratch, freezing ViT backbone during pre-training, effect of multi-lingual examples in training. Appendix contains model details (ViT-e architecture), training details (used T5X, Flaxformer, with Adafactor optimizer) - used fewer TPU chips on GCP, continued pretraining at higher resolution; custom prompts, details, and mixing ratio for pre-training task mixture; WebLI dataset details, additional results (LiT on ViT-e), analysis of bias in dataset, and model cards for PaLI and WebLI. From Google Research (Lucas Beyer).