-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 14 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 40
Collections
Discover the best community collections!
Collections including paper arxiv:2409.18869
-
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
Paper • 2411.02959 • Published • 64 -
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
Paper • 2411.03047 • Published • 8 -
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Paper • 2411.02336 • Published • 23 -
GenXD: Generating Any 3D and 4D Scenes
Paper • 2411.02319 • Published • 20
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 603 -
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper • 2410.18057 • Published • 200 -
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Paper • 2410.22366 • Published • 75 -
Emu3: Next-Token Prediction is All You Need
Paper • 2409.18869 • Published • 91
-
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper • 2410.16153 • Published • 42 -
AutoTrain: No-code training for state-of-the-art models
Paper • 2410.15735 • Published • 57 -
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Paper • 2410.12787 • Published • 30 -
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper • 2410.01744 • Published • 25
-
Addition is All You Need for Energy-efficient Language Models
Paper • 2410.00907 • Published • 144 -
Emu3: Next-Token Prediction is All You Need
Paper • 2409.18869 • Published • 91 -
An accurate detection is not all you need to combat label noise in web-noisy datasets
Paper • 2407.05528 • Published • 3 -
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Paper • 2407.00402 • Published • 22
-
Qwen2.5-Coder Technical Report
Paper • 2409.12186 • Published • 138 -
Attention Heads of Large Language Models: A Survey
Paper • 2409.03752 • Published • 88 -
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Paper • 2409.02634 • Published • 90 -
OmniGen: Unified Image Generation
Paper • 2409.11340 • Published • 108