multilingual vision models - a rchan26 Collection

rchan26 's Collections

mechanistic interpretability with sparse autoencoders

multilingual vision models

updated Dec 11, 2024

Some papers I read for understanding vision models and also adding multilingual capabilities to them

Upvote

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 87

Note Great overview on vision-language modelling approaches
Visual Instruction Tuning

Paper • 2304.08485 • Published Apr 17, 2023 • 13

Note - Among the first models to incorporate instruction fine-tuning in vision language models to improve multimodal chat capabilities - Generate 158k synthetically generated visual instruction samples using GPT-4 - Original LLaVA model incorporated a pretrained Vicuna LM and a pretrained CLIP vision encoder and fine-tuned end-to-end on generated vision-language instruction-following data
Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 37
PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22, 2024 • 23

Note - Develop a multilingual LLM covering 10 languages using similar architecture as in LLaVA - use pretrained CLIP and Vicuna using a two-layer MLP with GELU as the projector between modalities - Multilingual dataset curated using a semi-automated translation pipeline - Translated LLaVA dataset
Aya 23: Open Weight Releases to Further Multilingual Progress

Paper • 2405.15032 • Published May 23, 2024 • 28

Note - Introduce Aya 23, a family of multilingual (text-only) language models supporting 23 languages based on Cohere’s “Command” model which are pre-trained using a data mixture that includes texts from 23 languages and fine-tuned on the Aya multilingual instruction data - Available in 8B and 35B sizes
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Paper • 2402.07827 • Published Feb 12, 2024 • 47
Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4, 2024 • 36
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 51
PaLI: A Jointly-Scaled Multilingual Language-Image Model

Paper • 2209.06794 • Published Sep 14, 2022 • 2
Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published Dec 10, 2024 • 27
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21, 2024 • 44
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 106
The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 110

Upvote