-
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Paper • 2401.10208 • Published • 1 -
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Paper • 2305.11172 • Published -
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Paper • 2302.00402 • Published -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 6
Collections
Discover the best community collections!
Collections including paper arxiv:2403.05525
-
Beyond Language Models: Byte Models are Digital World Simulators
Paper • 2402.19155 • Published • 44 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 49 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 40 -
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Paper • 2403.00071 • Published • 19
-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper • 2402.12226 • Published • 37 -
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition
Paper • 2401.11649 • Published • 3 -
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition
Paper • 2402.15504 • Published • 19 -
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Paper • 2402.17485 • Published • 182
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 135 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 10 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 47 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 41
-
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper • 2312.16862 • Published • 28 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 24 -
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Paper • 2401.01974 • Published • 4 -
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 26
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 174 -
Learning Vision from Models Rivals Learning Vision from Data
Paper • 2312.17742 • Published • 12 -
PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation
Paper • 2312.17276 • Published • 14 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 10
-
Kosmos-2.5: A Multimodal Literate Model
Paper • 2309.11419 • Published • 48 -
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Paper • 2311.05698 • Published • 6 -
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Paper • 2311.06242 • Published • 24 -
PolyMaX: General Dense Prediction with Mask Transformer
Paper • 2311.05770 • Published • 6