ColPali: Efficient Document Retrieval with Vision Language Models Paper • 2407.01449 • Published Jun 27 • 42
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper • 2412.08802 • Published 9 days ago • 4
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Paper • 2412.09283 • Published 8 days ago • 19
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding Paper • 2412.09616 • Published 8 days ago • 1
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Paper • 2412.08580 • Published 9 days ago • 43
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training Paper • 2411.11927 • Published Nov 18 • 1
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions Paper • 2411.16828 • Published 25 days ago • 1
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training Paper • 2412.01814 • Published 18 days ago • 1
Active Data Curation Effectively Distills Large-Scale Multimodal Models Paper • 2411.18674 • Published 23 days ago • 1
FLAIR: VLM with Fine-grained Language-informed Image Representations Paper • 2412.03561 • Published 16 days ago • 1