BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published Apr 10 • 14
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Paper • 2404.15653 • Published Apr 24 • 24
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published 27 days ago • 101
Many-Shot In-Context Learning in Multimodal Foundation Models Paper • 2405.09798 • Published 27 days ago • 25
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published 19 days ago • 42