VisionLM
updated
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
• 2402.04252
• Published • 31
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
• 2402.03749
• Published • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
• 2402.04615
• Published • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
• 2402.05008
• Published • 24
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
• 2402.05930
• Published • 39
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
• 2402.05935
• Published • 18
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
• 2402.06118
• Published • 16
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
• 2402.07456
• Published • 46
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
• 2402.07872
• Published • 16
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
• 2402.07865
• Published • 16
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published • 40
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
• 2402.10896
• Published • 17
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
• 2402.10986
• Published • 83
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
• 2402.12226
• Published • 45
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
• 2402.11248
• Published • 22
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
• 2402.11690
• Published • 10
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
• 2402.13217
• Published • 41
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
• 2402.13250
• Published • 27
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published • 17
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
• 2402.13220
• Published • 14
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
• 2402.13577
• Published • 9
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
• 2402.14818
• Published • 24
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published • 21
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
• 2402.17177
• Published • 88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
• 2402.19479
• Published • 35
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
• 2403.01422
• Published • 30
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
• 2403.01487
• Published • 17
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
• 2403.02677
• Published • 19
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
• 2403.02626
• Published • 11
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
• 2403.03194
• Published • 15
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
• 2403.03003
• Published • 11
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published • 130
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published • 78
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
• 2403.07750
• Published • 24
DragAnything: Motion Control for Anything using Entity Representation
Paper
• 2403.07420
• Published • 14
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
• 2403.06764
• Published • 27
VideoMamba: State Space Model for Efficient Video Understanding
Paper
• 2403.06977
• Published • 29
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published • 45
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
• 2403.05530
• Published • 65
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published • 50
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
• 2403.05438
• Published • 20
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published • 54
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published • 37
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published • 18
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
• 2403.11481
• Published • 13
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
• 2403.12895
• Published • 32
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
• 2403.12596
• Published • 12
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published • 53
Can large language models explore in-context?
Paper
• 2403.15371
• Published • 33
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
• 2403.15377
• Published • 29
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
• 2403.15360
• Published • 13
VidLA: Video-Language Alignment at Scale
Paper
• 2403.14870
• Published • 14
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published • 56
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published • 49
sDPO: Don't Use Your Data All at Once
Paper
• 2403.19270
• Published • 41
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
• 2403.18978
• Published • 15
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
• 2403.20331
• Published • 16
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published • 31
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
• 2404.01258
• Published • 12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published • 27
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published • 25
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
• 2404.03653
• Published • 35
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published • 83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
• 2404.05726
• Published • 23
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
• 2404.05674
• Published • 15
Koala: Key frame-conditioned long video-LLM
Paper
• 2404.04346
• Published • 7
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published • 31
Adapting LLaMA Decoder to Vision Transformer
Paper
• 2404.06773
• Published • 19
BRAVE: Broadening the visual encoding of vision-language models
Paper
• 2404.07204
• Published • 20
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
• 2404.07448
• Published • 12
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published • 33
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
• 2404.09990
• Published • 14
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
• 2404.09204
• Published • 11
On Speculative Decoding for Multimodal Large Language Models
Paper
• 2404.08856
• Published • 13
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published • 41
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published • 27
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
• 2404.14239
• Published • 9
A Multimodal Automated Interpretability Agent
Paper
• 2404.14394
• Published • 22
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
• 2404.12803
• Published • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published • 31
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
• 2404.15653
• Published • 29
Editable Image Elements for Controllable Synthesis
Paper
• 2404.16029
• Published • 12
MoDE: CLIP Data Experts via Clustering
Paper
• 2404.16030
• Published • 14
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published • 10
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
• 2404.16821
• Published • 59
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
• 2404.16375
• Published • 18
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published • 38
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
• 2404.16845
• Published • 7
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
• 2404.17672
• Published • 19
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
• 2404.17521
• Published • 13
Automatic Creative Selection with Cross-Modal Matching
Paper
• 2405.00029
• Published • 9
What matters when building vision-language models?
Paper
• 2405.02246
• Published • 104
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
• 2405.07990
• Published • 20
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
• 2405.08344
• Published • 15
Understanding the performance gap between online and offline alignment
algorithms
Paper
• 2405.08448
• Published • 18
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
• 2405.08295
• Published • 19
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
• 2405.08317
• Published • 13
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
• 2405.09215
• Published • 22
LoRA Learns Less and Forgets Less
Paper
• 2405.09673
• Published • 91
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
• 2405.09798
• Published • 32
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
• 2405.09818
• Published • 135
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
• 2405.10300
• Published • 31
Toon3D: Seeing Cartoons from a New Perspective
Paper
• 2405.10320
• Published • 22
Octo: An Open-Source Generalist Robot Policy
Paper
• 2405.12213
• Published • 29
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
• 2405.12107
• Published • 29
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published • 157
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
• 2405.12399
• Published • 30
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published • 14
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
• 2405.13195
• Published • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
• 2405.15574
• Published • 55
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
• 2405.15216
• Published • 15
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published • 91
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published • 35
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
• 2405.17428
• Published • 21
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published • 46
Dense Connector for MLLMs
Paper
• 2405.13800
• Published • 24
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
• 2405.14598
• Published • 13
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
• 2405.20204
• Published • 37
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
• 2405.18669
• Published • 12
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
• 2405.20340
• Published • 20
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published • 26
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
• 2406.00888
• Published • 33
Parrot: Multilingual Visual Instruction Tuning
Paper
• 2406.02539
• Published • 37
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
• 2406.02884
• Published • 18
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published • 75
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
• 2406.04151
• Published • 24
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
• 2406.01014
• Published • 33
Vript: A Video Is Worth Thousands of Words
Paper
• 2406.06040
• Published • 28
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
• 2406.07550
• Published • 60
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
• 2406.06911
• Published • 12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
• 2406.07476
• Published • 37
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
• 2406.08478
• Published • 43
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
• 2406.08407
• Published • 28
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published • 55
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
• 2406.11839
• Published • 40
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published • 26
TroL: Traversal of Layers for Large Language and Vision Models
Paper
• 2406.12246
• Published • 36
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
• 2406.12275
• Published • 31
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
• 2406.12742
• Published • 15
Adversarial Attacks on Multimodal Agents
Paper
• 2406.12814
• Published • 4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
• 2406.11230
• Published • 33
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
• 2406.12649
• Published • 16
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
• 2406.09358
• Published • 5
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
• 2406.09356
• Published • 6
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
• 2406.09406
• Published • 15
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published • 23
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
• 2406.09411
• Published • 19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published • 17
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
• 2406.09162
• Published • 14
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published • 33
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
• 2406.08451
• Published • 26
Paper
• 2406.04127
• Published • 39
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
• 2406.06523
• Published • 53
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
• 2406.08487
• Published • 14
VCR: Visual Caption Restoration
Paper
• 2406.06462
• Published • 13
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published • 52
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published • 47
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
• 2406.08552
• Published • 25
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
• 2406.04338
• Published • 39
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
• 2406.05074
• Published • 10
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
• 2406.10210
• Published • 78
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
• 2406.08973
• Published • 89
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
• 2406.11833
• Published • 62
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
• 2406.11831
• Published • 22
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
• 2406.10328
• Published • 18
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
• 2406.14544
• Published • 35
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
• 2406.11069
• Published • 14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published • 22
Paper
• 2406.11775
• Published • 9
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
• 2406.11251
• Published • 12
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
• 2406.10601
• Published • 70
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
• 2406.14515
• Published • 33
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
• 2406.14035
• Published • 13
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
• 2406.14596
• Published • 5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
• 2406.11403
• Published • 4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
• 2406.16338
• Published • 26
Long Context Transfer from Language to Vision
Paper
• 2406.16852
• Published • 33
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published • 63
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
• 2406.17770
• Published • 19
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
• 2406.15704
• Published • 6
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
• 2406.18082
• Published • 48
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
• 2406.18521
• Published • 31
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
• 2406.15334
• Published • 9
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
• 2406.17294
• Published • 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published • 54
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published • 42
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
• 2406.18790
• Published • 34
Simulating Classroom Education with LLM-Empowered Agents
Paper
• 2406.19226
• Published • 32
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for
Vision-Language Models
Paper
• 2406.10900
• Published • 11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
• 2406.20095
• Published • 18
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
• 2406.20076
• Published • 10
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
• 2406.17720
• Published • 8
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
• 2407.01284
• Published • 81
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
• 2406.19741
• Published • 60
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and
Efficient Evaluation
Paper
• 2407.00468
• Published • 36
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
• 2407.01449
• Published • 51
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
• 2407.00114
• Published • 13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published • 24
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published • 94
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published • 23
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published • 55
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
Streams
Paper
• 2406.08085
• Published • 17
Granular Privacy Control for Geolocation with Vision Language Models
Paper
• 2407.04952
• Published • 7
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
• 2407.06135
• Published • 22
Multi-Object Hallucination in Vision-Language Models
Paper
• 2407.06192
• Published • 12
Vision language models are blind
Paper
• 2407.06581
• Published • 84
VIMI: Grounding Video Generation through Multi-modal Instruction
Paper
• 2407.06304
• Published • 10
Video-to-Audio Generation with Hidden Alignment
Paper
• 2407.07464
• Published • 17
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
• 2407.03958
• Published • 21
Understanding Visual Feature Reliance through the Lens of Complexity
Paper
• 2407.06076
• Published • 6
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting
Region Captions
Paper
• 2407.06723
• Published • 11
PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published • 73
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
• 2407.07895
• Published • 42
Do Vision and Language Models Share Concepts? A Vector Space Alignment
Study
Paper
• 2302.06555
• Published • 9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
Perception
Paper
• 2407.08303
• Published • 19
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
• 2407.08583
• Published • 13
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published • 47
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
• 2407.12580
• Published • 43
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
• 2407.12679
• Published • 8
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Paper
• 2407.09018
• Published • 5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in
Clutter
Paper
• 2407.11298
• Published • 6
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-Language Models
Paper
• 2407.12366
• Published • 4
Benchmarking Trustworthiness of Multimodal Large Language Models: A
Comprehensive Study
Paper
• 2406.07057
• Published • 17
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
• 2407.14177
• Published • 45
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
• 2407.12594
• Published • 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published • 39
VideoGameBunny: Towards vision assistants for video games
Paper
• 2407.15295
• Published • 23
CGB-DM: Content and Graphic Balance Layout Generation with
Transformer-based Diffusion Model
Paper
• 2407.15233
• Published • 7
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
• 2407.16224
• Published • 29
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
• 2407.16655
• Published • 30
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published • 13
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published • 41
Learning to Manipulate Anywhere: A Visual Generalizable Framework For
Reinforcement Learning
Paper
• 2407.15815
• Published • 14
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
• 2407.17490
• Published • 31
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published • 17
VSSD: Vision Mamba with Non-Casual State Space Duality
Paper
• 2407.18559
• Published • 20
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published • 33
Diffusion Feedback Helps CLIP See Better
Paper
• 2407.20171
• Published • 36
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
• 2407.19795
• Published • 11
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
• 2407.19985
• Published • 37
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
• 2407.21770
• Published • 22
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
• 2407.21646
• Published • 18
ShieldGemma: Generative AI Content Moderation Based on Gemma
Paper
• 2407.21772
• Published • 15
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
• 2407.21721
• Published • 9
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published • 123
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published • 24
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
• 2407.21794
• Published • 6
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published • 96
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
• 2408.02657
• Published • 35
Language Model Can Listen While Speaking
Paper
• 2408.02622
• Published • 40
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
• 2408.02210
• Published • 9
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Paper
• 2408.02373
• Published • 5
LLaVA-OneVision: Easy Visual Task Transfer
Paper
• 2408.03326
• Published • 61
Diffusion Models as Data Mining Tools
Paper
• 2408.02752
• Published • 15
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
• 2408.01708
• Published • 4
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in
Long-Horizon Tasks
Paper
• 2408.03615
• Published • 31
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
• 2408.03695
• Published • 13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
• 2408.03900
• Published • 10
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
• 2408.04567
• Published • 26
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
• 2408.04594
• Published • 14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior
for Part-Level Dynamics
Paper
• 2408.04631
• Published • 9
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
• 2408.05211
• Published • 50
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
• 2408.04840
• Published • 33
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
Scaling
Paper
• 2408.04810
• Published • 24
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
• 2408.06070
• Published • 55
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
• 2408.06327
• Published • 17
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization
Paper
• 2408.05939
• Published • 14
Paper
• 2408.07009
• Published • 62
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
• 2408.06663
• Published • 16
Paper
• 2408.05366
• Published • 14
Towards flexible perception with visual memory
Paper
• 2408.08172
• Published • 23
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published • 101
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Paper
• 2408.08459
• Published • 45
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
• 2408.08441
• Published • 8
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published • 52
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
• 2408.11001
• Published • 13
Factorized-Dreamer: Training A High-Quality Video Generator with Limited
and Low-Quality Data
Paper
• 2408.10119
• Published • 17
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
• 2408.11039
• Published • 63
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with
Patch Neighbor Consistency
Paper
• 2408.11054
• Published • 14
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
for Efficient Inference Intervention in Large Language Model
Paper
• 2408.10764
• Published • 9
Audio Match Cutting: Finding and Creating Matching Audio Transitions in
Movies and Videos
Paper
• 2408.10998
• Published • 9
MambaEVT: Event Stream based Visual Object Tracking using State Space
Model
Paper
• 2408.10487
• Published • 7
FocusLLM: Scaling LLM's Context by Parallel Decoding
Paper
• 2408.11745
• Published • 25
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
• 2408.11318
• Published • 57
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
• 2408.11817
• Published • 9
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive
Prompt Weighting
Paper
• 2408.11706
• Published • 7
TrackGo: A Flexible and Efficient Method for Controllable Video
Generation
Paper
• 2408.11475
• Published • 18
Out-of-Distribution Detection with Attention Head Masking for Multimodal
Document Classification
Paper
• 2408.11237
• Published • 6
Iterative Object Count Optimization for Text-to-image Diffusion Models
Paper
• 2408.11721
• Published • 6
Sapiens: Foundation for Human Vision Models
Paper
• 2408.12569
• Published • 93
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published • 51
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
• 2408.11878
• Published • 64
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
• 2408.12590
• Published • 35
Scalable Autoregressive Image Generation with Mamba
Paper
• 2408.12245
• Published • 26
Real-Time Video Generation with Pyramid Attention Broadcast
Paper
• 2408.12588
• Published • 17
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
• 2408.12114
• Published • 15
Anim-Director: A Large Multimodal Model Powered Agent for Controllable
Animation Video Generation
Paper
• 2408.09787
• Published • 10
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published • 134
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published • 26
CustomCrafter: Customized Video Generation with Preserving Motion and
Concept Composition Abilities
Paper
• 2408.13239
• Published • 11
Foundation Models for Music: A Survey
Paper
• 2408.14340
• Published • 44
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Paper
• 2408.13402
• Published • 18
TVG: A Training-free Transition Video Generation Method with Diffusion
Models
Paper
• 2408.13413
• Published • 14
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and
Deduplication by Introducing a Competitive Large Language Model Baseline
Paper
• 2408.15079
• Published • 56
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published • 95
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published • 58
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
• 2408.16532
• Published • 50
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
• 2408.16725
• Published • 53
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
• 2408.17253
• Published • 39
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
• 2408.09174
• Published • 53
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published • 27
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
• 2409.02095
• Published • 38
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
• 2409.02097
• Published • 34
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published • 54
Attention Heads of Large Language Models: A Survey
Paper
• 2409.03752
• Published • 92
Open-MAGVIT2: An Open-Source Project Toward Democratizing
Auto-regressive Visual Generation
Paper
• 2409.04410
• Published • 25
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
• 2409.05840
• Published • 49
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
• 2409.02795
• Published • 72
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
• 2409.04828
• Published • 24
Benchmarking Chinese Knowledge Rectification in Large Language Models
Paper
• 2409.05806
• Published • 15
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
• 2409.06666
• Published • 60
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
• 2409.06135
• Published • 16
PingPong: A Benchmark for Role-Playing Language Models with User
Emulation and Multi-Model Evaluation
Paper
• 2409.06820
• Published • 68
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View
Synthesis
Paper
• 2409.07129
• Published • 8
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
• 2409.07239
• Published • 15
Ferret: Federated Full-Parameter Tuning at Scale for Large Language
Models
Paper
• 2409.06277
• Published • 15
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
• 2409.09269
• Published • 8
One missing piece in Vision and Language: A Survey on Comics
Understanding
Paper
• 2409.09502
• Published • 24
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published • 75
OmniGen: Unified Image Generation
Paper
• 2409.11340
• Published • 115
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
• 2409.11355
• Published • 30
OSV: One Step is Enough for High-Quality Image to Video Generation
Paper
• 2409.11367
• Published • 14
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
• 2409.03420
• Published • 26
InstantDrag: Improving Interactivity in Drag-based Image Editing
Paper
• 2409.08857
• Published • 34
AudioBERT: Audio Knowledge Augmented Language Model
Paper
• 2409.08199
• Published • 5
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Paper
• 2409.08554
• Published • 3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published • 80
Qwen2.5-Coder Technical Report
Paper
• 2409.12186
• Published • 157
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
• 2409.11564
• Published • 20
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
• 2409.12139
• Published • 12
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
• 2409.12961
• Published • 25
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
• 2409.12576
• Published • 16
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
• 2409.13346
• Published • 69
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
• 2409.13592
• Published • 50
Portrait Video Editing Empowered by Multimodal Generative Priors
Paper
• 2409.13591
• Published • 16
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
• 2409.15278
• Published • 24
Phantom of Latent for Large Language and Vision Models
Paper
• 2409.14713
• Published • 29
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror
Reflections
Paper
• 2409.14677
• Published • 15
MIMO: Controllable Character Video Synthesis with Spatial Decomposed
Modeling
Paper
• 2409.16160
• Published • 34
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
• 2409.16280
• Published • 18
Seeing Faces in Things: A Model and Dataset for Pareidolia
Paper
• 2409.16143
• Published • 18
Attention Prompting on Image for Large Vision-Language Models
Paper
• 2409.17143
• Published • 7
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
• 2409.17146
• Published • 123
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published • 53
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
• 2409.20566
• Published • 54
Visual Question Decomposition on Multimodal Large Language Models
Paper
• 2409.19339
• Published • 8
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
• 2410.02757
• Published • 36
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published • 54
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published • 37
Interpreting and Editing Vision-Language Representations to Mitigate
Hallucinations
Paper
• 2410.02762
• Published • 8
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
• 2410.02763
• Published • 7
Addition is All You Need for Energy-efficient Language Models
Paper
• 2410.00907
• Published • 151
VideoGuide: Improving Video Diffusion Models without Training Through a
Teacher's Guide
Paper
• 2410.04364
• Published • 29
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
• 2410.05243
• Published • 20
UniMuMo: Unified Text, Music and Motion Generation
Paper
• 2410.04534
• Published • 19
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
• 2410.04734
• Published • 19
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal
Instruction
Paper
• 2410.04932
• Published • 9
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
• 2410.01912
• Published • 14
ControlAR: Controllable Image Generation with Autoregressive Models
Paper
• 2410.02705
• Published • 11
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
• 2410.03290
• Published • 7
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
• 2410.05993
• Published • 111
Personalized Visual Instruction Tuning
Paper
• 2410.07113
• Published • 70
Paper
• 2410.07073
• Published • 69
IterComp: Iterative Composition-Aware Feedback Learning from Model
Gallery for Text-to-Image Generation
Paper
• 2410.07171
• Published • 43
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
• 2410.07167
• Published • 39
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation
Learning
Paper
• 2410.06373
• Published • 36
Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper
• 2410.05954
• Published • 40
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
• 2410.05363
• Published • 45
Story-Adapter: A Training-free Iterative Framework for Long Story
Visualization
Paper
• 2410.06244
• Published • 20
MM-Ego: Towards Building Egocentric Multimodal LLMs
Paper
• 2410.07177
• Published • 22
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based
Image/Video Generation
Paper
• 2410.05591
• Published • 13
Temporal Reasoning Transfer from Text to Video
Paper
• 2410.06166
• Published • 13
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
• 2410.03450
• Published • 36
Intriguing Properties of Large Language and Vision Models
Paper
• 2410.04751
• Published • 16
Progressive Autoregressive Video Diffusion Models
Paper
• 2410.08151
• Published • 16
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
Vision-Linguistic Compositionality
Paper
• 2410.05210
• Published • 11
Self-Boosting Large Language Models with Synthetic Preference Data
Paper
• 2410.06961
• Published • 16
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM
Agents
Paper
• 2410.07484
• Published • 51
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
• 2410.08164
• Published • 26
GLOV: Guided Large Language Models as Implicit Optimizers for Vision
Language Models
Paper
• 2410.06154
• Published • 16
Baichuan-Omni Technical Report
Paper
• 2410.08565
• Published • 88
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
• 2410.06456
• Published • 37
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large
Vision-Language Models
Paper
• 2410.07133
• Published • 19
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
• 2410.10139
• Published • 51
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
• 2410.10594
• Published • 29
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
• 2410.11779
• Published • 26
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
• 2410.10816
• Published • 21
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Paper
• 2410.11817
• Published • 15
OMCAT: Omni Context Aware Transformer
Paper
• 2410.12109
• Published • 4
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
Embodied AI
Paper
• 2410.11623
• Published • 49
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
• 2410.12381
• Published • 43
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
• 2410.12787
• Published • 30
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published • 37
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
• 2410.13824
• Published • 30
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
• 2410.12705
• Published • 32
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
• 2410.13863
• Published • 37
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper
• 2410.13757
• Published • 32
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
• 2410.13268
• Published • 33
Movie Gen: A Cast of Media Foundation Models
Paper
• 2410.13720
• Published • 100
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise
Motion Control
Paper
• 2410.13830
• Published • 26
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
• 2410.13085
• Published • 25
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Paper
• 2410.13639
• Published • 19
VidPanos: Generative Panoramic Videos from Casual Panning Videos
Paper
• 2410.13832
• Published • 13
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
• 2410.13360
• Published • 9
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
• 2410.13859
• Published • 8
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Paper
• 2410.13854
• Published • 12
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
Model
Paper
• 2410.13925
• Published • 24
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
• 2410.11190
• Published • 22
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Paper
• 2410.14745
• Published • 47
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
• 2410.16268
• Published • 70
Baichuan Alignment Technical Report
Paper
• 2410.14940
• Published • 51
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published • 56
Toward Guidance-Free AR Visual Generation via Condition Contrastive
Alignment
Paper
• 2410.09347
• Published • 5
AutoTrain: No-code training for state-of-the-art models
Paper
• 2410.15735
• Published • 59
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
and Style
Paper
• 2410.16184
• Published • 26
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
• 2410.15316
• Published • 12
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
• 2410.17247
• Published • 47
Aligning Large Language Models via Self-Steering Optimization
Paper
• 2410.17131
• Published • 24
Improve Vision Language Model Chain-of-thought Reasoning
Paper
• 2410.16198
• Published • 26
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video
Even in VLMs
Paper
• 2410.16267
• Published • 18
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
• 2410.17637
• Published • 35
Can Knowledge Editing Really Correct Hallucinations?
Paper
• 2410.16251
• Published • 55
LOGO -- Long cOntext aliGnment via efficient preference Optimization
Paper
• 2410.18533
• Published • 43
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
• 2410.18798
• Published • 21
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
• 2410.18558
• Published • 19
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
• 2410.17779
• Published • 8
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
• 2410.17856
• Published • 52
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
• 2410.16048
• Published • 30
Paper
• 2410.21276
• Published • 88
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
• 2410.21220
• Published • 11
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
• 2410.18057
• Published • 209
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Paper
• 2410.22587
• Published • 11
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Paper
• 2410.23287
• Published • 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published • 49
Personalization of Large Language Models: A Survey
Paper
• 2411.00027
• Published • 33
Randomized Autoregressive Visual Generation
Paper
• 2411.00776
• Published • 19
Face Anonymization Made Simple
Paper
• 2411.00762
• Published • 9
AndroidLab: Training and Systematic Benchmarking of Android Autonomous
Agents
Paper
• 2410.24024
• Published • 49
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum
Reinforcement Learning
Paper
• 2411.02337
• Published • 36
How Far is Video Generation from World Model: A Physical Law Perspective
Paper
• 2411.02385
• Published • 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent
Paper
• 2411.02265
• Published • 26
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Paper
• 2411.02397
• Published • 23
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions
Paper
• 2411.02394
• Published • 16
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for
Efficient Robot Execution
Paper
• 2411.02359
• Published • 14
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
• 2411.03823
• Published • 50
Adaptive Length Image Tokenization via Recurrent Allocation
Paper
• 2411.02393
• Published • 13
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning
Paper
• 2411.05003
• Published • 71
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation
Paper
• 2411.04709
• Published • 27
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page
Multi-document Understanding
Paper
• 2411.04952
• Published • 29
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale
Haystacks?
Paper
• 2411.05000
• Published • 22
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
• 2411.04923
• Published • 23
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published • 24
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
• 2411.04997
• Published • 39
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned
Vision-Language Models
Paper
• 2411.04097
• Published • 5
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
• 2411.07199
• Published • 50
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language
Models
Paper
• 2411.07140
• Published • 35
Edify Image: High-Quality Image Generation with Pixel Space Laplacian
Diffusion Models
Paper
• 2411.07126
• Published • 30
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
• 2411.07232
• Published • 68
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published • 32
Autoregressive Models in Vision: A Survey
Paper
• 2411.05902
• Published • 19
MagicQuill: An Intelligent Interactive Image Editing System
Paper
• 2411.09703
• Published • 80
Sharingan: Extract User Action Sequence from Desktop Recordings
Paper
• 2411.08768
• Published • 9
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published • 132
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
• 2411.06558
• Published • 36
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
• 2411.10323
• Published • 34
Number it: Temporal Grounding Videos like Flipping Manga
Paper
• 2411.10332
• Published • 14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
• 2411.10640
• Published • 47
Generative World Explorer
Paper
• 2411.11844
• Published • 77
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
• 2411.10836
• Published • 24
SlimLM: An Efficient Small Language Model for On-Device Document
Assistance
Paper
• 2411.09944
• Published • 12
Adaptive Decoding via Latent Preference Optimization
Paper
• 2411.09661
• Published • 10
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
Paper
• 2411.11045
• Published • 11
RedPajama: an Open Dataset for Training Large Language Models
Paper
• 2411.12372
• Published • 59
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization
Paper
• 2411.11909
• Published • 22
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
• 2411.10818
• Published • 26
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text,
and Architectural Enhancements
Paper
• 2411.12044
• Published • 14
Continuous Speculative Decoding for Autoregressive Image Generation
Paper
• 2411.11925
• Published • 16
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published • 87
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published • 48
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
• 2411.14432
• Published • 26
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published • 19
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple
Distillation, Big Progress or Bitter Lesson?
Paper
• 2411.16489
• Published • 46
One Diffusion to Generate Them All
Paper
• 2411.16318
• Published • 28
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
• 2411.16657
• Published • 19
Factorized Visual Tokenization and Generation
Paper
• 2411.16681
• Published • 19
TEXGen: a Generative Diffusion Model for Mesh Textures
Paper
• 2411.14740
• Published • 17
ROICtrl: Boosting Instance Control for Visual Generation
Paper
• 2411.17949
• Published • 87
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published • 90
SketchAgent: Language-Driven Sequential Sketch Generation
Paper
• 2411.17673
• Published • 18
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published • 19
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
• 2411.15296
• Published • 21
Large Language Model-Brained GUI Agents: A Survey
Paper
• 2411.18279
• Published • 30
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published • 5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
• 2411.18203
• Published • 41
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published • 30
Yi-Lightning Technical Report
Paper
• 2412.01253
• Published • 29
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
• 2412.01824
• Published • 64
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
• 2412.00927
• Published • 29
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
• 2412.00131
• Published • 33
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction
with 3D Autonomous Characters
Paper
• 2412.00174
• Published • 23
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
• 2412.00947
• Published • 8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
• 2412.02611
• Published • 25
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published • 136
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
• 2412.03069
• Published • 34
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
• 2412.00493
• Published • 17
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
Prompt Instruction Tuning
Paper
• 2412.03565
• Published • 10
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published • 119
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published • 63
NVILA: Efficient Frontier Visual Language Models
Paper
• 2412.04468
• Published • 62
Negative Token Merging: Image-based Adversarial Feature Guidance
Paper
• 2412.01339
• Published • 22
Personalized Multimodal Large Language Models: A Survey
Paper
• 2412.02142
• Published • 13
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published • 13
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Paper
• 2412.04449
• Published • 7
Scaling Inference-Time Search with Vision Value Model for Improved
Visual Comprehension
Paper
• 2412.03704
• Published • 6
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
• 2412.05271
• Published • 162
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
• 2412.05237
• Published • 46
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper
• 2412.04814
• Published • 46
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
• 2412.04301
• Published • 40
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published • 21
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Paper
• 2412.05263
• Published • 10
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
• 2412.04626
• Published • 15
Training Large Language Models to Reason in a Continuous Latent Space
Paper
• 2412.06769
• Published • 94
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation
Paper
• 2412.06781
• Published • 24
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
• 2412.07112
• Published • 29
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper
• 2412.04432
• Published • 16
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
• 2412.05939
• Published • 15
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
• 2412.07589
• Published • 48
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Paper
• 2412.03548
• Published • 17
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published • 39
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
• 2412.08580
• Published • 46
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
• 2412.07147
• Published • 5
StreamChat: Chatting with Streaming Video
Paper
• 2412.08646
• Published • 18
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published • 97
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published • 54
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Paper
• 2412.09501
• Published • 48
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published • 49
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via
Multimodal LLM
Paper
• 2412.09618
• Published • 21
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
• 2412.08687
• Published • 13
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Paper
• 2412.09013
• Published • 13
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published • 148
GenEx: Generating an Explorable World
Paper
• 2412.09624
• Published • 98
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published • 19
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
• 2412.09428
• Published • 7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published • 39
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published • 109
BrushEdit: All-In-One Image Inpainting and Editing
Paper
• 2412.10316
• Published • 37
VidTok: A Versatile and Open-Source Video Tokenizer
Paper
• 2412.13061
• Published • 8
Paper
• 2412.13501
• Published • 30
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published • 73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
• 2412.14475
• Published • 59
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
• 2412.14233
• Published • 7
Large Motion Video Autoencoding with Cross-modal Video VAE
Paper
• 2412.17805
• Published • 24
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation
Understanding
Paper
• 2412.17295
• Published • 9
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper
• 2412.15213
• Published • 28
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
Paper
• 2412.14462
• Published • 15
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal
Audio-Video Generation
Paper
• 2412.15191
• Published • 5
Parallelized Autoregressive Visual Generation
Paper
• 2412.15119
• Published • 53
Taming Multimodal Joint Training for High-Quality Video-to-Audio
Synthesis
Paper
• 2412.15322
• Published • 20
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
Paper
• 2412.11525
• Published • 11
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published • 42
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models
with Flow Matching
Paper
• 2412.17153
• Published • 39
NILE: Internal Consistency Alignment in Large Language Models
Paper
• 2412.16686
• Published • 8
DepthLab: From Partial to Complete
Paper
• 2412.18153
• Published • 36
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D
Scene Understanding
Paper
• 2412.18450
• Published • 37
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
• 2412.17739
• Published • 41
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion
Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Paper
• 2412.18597
• Published • 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
System?
Paper
• 2412.18495
• Published • 9
Video-Panda: Parameter-efficient Alignment for Encoder-free
Video-Language Models
Paper
• 2412.18609
• Published • 17
Bridging the Data Provenance Gap Across Text, Speech and Video
Paper
• 2412.17847
• Published • 13
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published • 39
YuLan-Mini: An Open Data-efficient Language Model
Paper
• 2412.17743
• Published • 67
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
• 2412.18072
• Published • 18
Molar: Multimodal LLMs with Collaborative Filtering Alignment for
Enhanced Sequential Recommendation
Paper
• 2412.18176
• Published • 16
Paper
• 2412.18653
• Published • 87
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Paper
• 2412.18619
• Published • 60
Task Preference Optimization: Improving Multimodal Large Language Models
with Vision Task Alignment
Paper
• 2412.19326
• Published • 18
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Paper
• 2412.19512
• Published • 9
Explanatory Instructions: Towards Unified Vision Tasks Understanding and
Zero-shot Generalization
Paper
• 2412.18525
• Published • 74
Edicho: Consistent Image Editing in the Wild
Paper
• 2412.21079
• Published • 22
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
Matching and Clap-Ranked Preference Optimization
Paper
• 2412.21037
• Published • 24
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Paper
• 2412.20750
• Published • 20
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published • 111
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
Control
Paper
• 2501.01427
• Published • 53
LTX-Video: Realtime Video Latent Diffusion
Paper
• 2501.00103
• Published • 51
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
• 2501.00599
• Published • 46
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
• 2501.00192
• Published • 31
A3: Android Agent Arena for Mobile GUI Agents
Paper
• 2501.01149
• Published • 22
Unifying Specialized Visual Encoders for Video Language Models
Paper
• 2501.01426
• Published • 20
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper
• 2501.01957
• Published • 48
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published • 52
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published • 44
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published • 82
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
Models
Paper
• 2501.03262
• Published • 104
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published • 49
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment
across Language with Real-time Self-Aware Emotional Speech Synthesis
Paper
• 2501.04561
• Published • 16
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning
and Reflection
Paper
• 2501.04575
• Published • 25
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper
• 2501.05366
• Published • 105
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich
Paradigm for Direct Preference Optimization
Paper
• 2501.03271
• Published • 10
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper
• 2501.05441
• Published • 98
Enhancing Human-Like Responses in Large Language Models
Paper
• 2501.05032
• Published • 62
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published • 41
Centurio: On Drivers of Multilingual Ability of Large Vision-Language
Model
Paper
• 2501.05122
• Published • 19
On Computational Limits and Provably Efficient Criteria of Visual
Autoregressive Models: A Fine-Grained Complexity Analysis
Paper
• 2501.04377
• Published • 14
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
• 2501.05874
• Published • 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published • 67
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
• 2501.05767
• Published • 29
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published • 44
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Paper
• 2501.06282
• Published • 54
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published • 34
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale
Pre-Training
Paper
• 2501.07556
• Published • 7
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published • 30
RepVideo: Rethinking Cross-Layer Representation for Video Generation
Paper
• 2501.08994
• Published • 15
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
• 2501.05452
• Published • 15
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
• 2501.05707
• Published • 20
VideoAuteur: Towards Long Narrative Video Generation
Paper
• 2501.06173
• Published • 31
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper
• 2501.06842
• Published • 16
Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Paper
• 2501.06708
• Published • 5
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published • 305
Democratizing Text-to-Image Masked Generative Models with Compact
Text-Aware One-Dimensional Tokens
Paper
• 2501.07730
• Published • 18
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Paper
• 2501.08292
• Published • 17
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
• 2501.07888
• Published • 15
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for
LLM Training
Paper
• 2501.08197
• Published • 9
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
• 2501.07783
• Published • 8
MINIMA: Modality Invariant Image Matching
Paper
• 2412.19412
• Published • 4
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
• 2501.09751
• Published • 46
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
• 2501.09755
• Published • 35
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published • 34
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published • 29
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published • 27
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published • 83
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published • 28
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step
Paper
• 2501.13926
• Published • 43
Baichuan-Omni-1.5 Technical Report
Paper
• 2501.15368
• Published • 61
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published • 73
Towards General-Purpose Model-Free Reinforcement Learning
Paper
• 2501.16142
• Published • 31
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for
Speech Generation
Paper
• 2501.15907
• Published • 18
Are Vision Language Models Texture or Shape Biased and Can We Steer
Them?
Paper
• 2403.09193
• Published • 9
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
• 2501.17161
• Published • 125
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published • 17
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models
Paper
• 2502.01061
• Published • 225
Process Reinforcement through Implicit Rewards
Paper
• 2502.01456
• Published • 62
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Paper
• 2502.01341
• Published • 40
Paper
• 2501.14249
• Published • 77
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published • 92
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
• 2501.12599
• Published • 131
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published • 61
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
• 2501.12948
• Published • 452
Token Assorted: Mixing Latent and Text Tokens for Improved Language
Model Reasoning
Paper
• 2502.03275
• Published • 18
Analyze Feature Flow to Enhance Interpretation and Steering in Language
Models
Paper
• 2502.03032
• Published • 60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
• 2502.04328
• Published • 29
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
• 2502.05173
• Published • 64
Fast Video Generation with Sliding Tile Attention
Paper
• 2502.04507
• Published • 51
Goku: Flow Based Video Generative Foundation Models
Paper
• 2502.04896
• Published • 107
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
• 2502.05171
• Published • 156
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
• 2502.05178
• Published • 10
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for
Mobile Devices
Paper
• 2502.04363
• Published • 12
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published • 153
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published • 29
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
• 2502.06329
• Published • 133
Magic 1-For-1: Generating One Minute Video Clips within One Minute
Paper
• 2502.07701
• Published • 37
Light-A-Video: Training-free Video Relighting via Progressive Light
Fusion
Paper
• 2502.08590
• Published • 43
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
• 2502.07870
• Published • 45
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
• 2502.08047
• Published • 28
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published • 69
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
• 2502.08468
• Published • 16
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of
Physical Concept Understanding
Paper
• 2502.08946
• Published • 193
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient
Text-to-Image Generation
Paper
• 2502.08690
• Published • 43
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published • 35
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published • 43
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of
Video Foundation Model
Paper
• 2502.10248
• Published • 57
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
• 2502.10391
• Published • 34
Large Language Diffusion Models
Paper
• 2502.09992
• Published • 128
Learning Getting-Up Policies for Real-World Humanoid Robots
Paper
• 2502.12152
• Published • 43
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published • 170
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on
Continual Pre-Training
Paper
• 2502.11196
• Published • 23
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning
in Diffusion Models
Paper
• 2502.10458
• Published • 38
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
Generation
Paper
• 2502.12148
• Published • 17
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published • 20
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
• 2502.11775
• Published • 9
Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation
Paper
• 2502.08826
• Published • 17
ILIAS: Instance-Level Image retrieval At Scale
Paper
• 2502.11748
• Published • 4
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published • 86
Continuous Diffusion Model for Language Modeling
Paper
• 2502.11564
• Published • 53
Phantom: Subject-consistent video generation via cross-modal alignment
Paper
• 2502.11079
• Published • 58
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published • 58
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
• 2502.12115
• Published • 46
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
• 2502.13145
• Published • 38
RealSyn: An Effective and Scalable Multimodal Interleaved Document
Transformation Paradigm
Paper
• 2502.12513
• Published • 16
Harnessing Vision Models for Time Series Analysis: A Survey
Paper
• 2502.08869
• Published • 2
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published • 219
On the Trustworthiness of Generative Foundation Models: Guideline,
Assessment, and Perspective
Paper
• 2502.14296
• Published • 45
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published • 166
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Paper
• 2502.14502
• Published • 92
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in
Vision-Language Models
Paper
• 2502.14834
• Published • 24
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
• 2502.14258
• Published • 26
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
• 2502.14282
• Published • 29
How to Get Your LLM to Generate Challenging Problems for Evaluation
Paper
• 2502.14678
• Published • 18
Dynamic Concepts Personalization from Single Videos
Paper
• 2502.14844
• Published • 16
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
• 2502.14846
• Published • 16
NAVIG: Natural Language-guided Analysis with Vision Language Models for
Image Geo-localization
Paper
• 2502.14638
• Published • 11
From RAG to Memory: Non-Parametric Continual Learning for Large Language
Models
Paper
• 2502.14802
• Published • 13
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the
Limits of Embedding Space Capacity
Paper
• 2502.13063
• Published • 74
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
• 2502.12084
• Published • 35
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context
Memory of Transformers
Paper
• 2502.15007
• Published • 175
SurveyX: Academic Survey Automation via Large Language Models
Paper
• 2502.14776
• Published • 100
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
Paper
• 2502.14397
• Published • 41
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
• 2502.17157
• Published • 52
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published • 18
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon
Robotic Manipulation
Paper
• 2502.16707
• Published • 14
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
• 2502.18411
• Published • 74
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper
• 2502.18137
• Published • 60
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
• 2502.18449
• Published • 75
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Paper
• 2502.17363
• Published • 37
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent
Image Generation
Paper
• 2502.18364
• Published • 36
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
Paper
• 2502.18461
• Published • 17
Introducing Visual Perception Token into Multimodal Large Language Model
Paper
• 2502.17425
• Published • 16
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
• 2502.17422
• Published • 7
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven
Language Representation
Paper
• 2502.18302
• Published • 5
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
Paper
• 2502.17092
• Published • 3
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
• 2502.19400
• Published • 47
Towards an AI co-scientist
Paper
• 2502.18864
• Published • 53
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Paper
• 2502.19634
• Published • 62
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper
• 2502.20321
• Published • 30
Multimodal Representation Alignment for Image Generation: Text-Image
Interleaved Control Is Easier Than You Think
Paper
• 2502.20172
• Published • 29
HAIC: Improving Human Action Understanding and Generation with Better
Captions for Multi-modal Large Language Models
Paper
• 2502.20811
• Published • 3
Chain of Draft: Thinking Faster by Writing Less
Paper
• 2502.18600
• Published • 50
Tell me why: Visual foundation models as self-explainable classifiers
Paper
• 2502.19577
• Published • 11
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Paper
• 2502.20545
• Published • 22
MIGE: A Unified Framework for Multimodal Instruction-Based Image
Generation and Editing
Paper
• 2502.21291
• Published • 5
Predictive Data Selection: The Data That Predicts Is the Data That
Teaches
Paper
• 2503.00808
• Published • 57
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published • 86
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published • 91
Qilin: A Multimodal Information Retrieval Dataset with APP-level User
Sessions
Paper
• 2503.00501
• Published • 12
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models
Paper
• 2402.03300
• Published • 145
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in
Multimodal Cycles
Paper
• 2503.03651
• Published • 16
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended
Language Interface
Paper
• 2503.01342
• Published • 8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence
Generation up to 100K Tokens
Paper
• 2502.18890
• Published • 30
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from
Inputs
Paper
• 2503.02003
• Published • 48
Process-based Self-Rewarding Language Models
Paper
• 2503.03746
• Published • 39
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time
Cognitive Task Solving and Reasoning in UAVs
Paper
• 2503.01378
• Published • 5
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published • 97
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Paper
• 2503.04724
• Published • 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
and Expert Reasoning Abilities
Paper
• 2503.03983
• Published • 29
How to Steer LLM Latents for Hallucination Detection?
Paper
• 2503.01917
• Published • 11
The Best of Both Worlds: Integrating Language Models and Diffusion
Models for Video Generation
Paper
• 2503.04606
• Published • 9
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published • 124
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
• 2503.05132
• Published • 57
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
• 2503.05179
• Published • 46
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following
with Paralinguistic Information
Paper
• 2503.05085
• Published • 47
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
• 2503.05379
• Published • 38
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play
Context Control
Paper
• 2503.05639
• Published • 27
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos
via Diffusion Models
Paper
• 2503.05638
• Published • 21
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published • 61
Automated Movie Generation via Multi-Agent CoT Planning
Paper
• 2503.07314
• Published • 44
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue
Learning
Paper
• 2503.07002
• Published • 39
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published • 31
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
• 2503.07920
• Published • 102
MagicInfinite: Generating Infinite Talking Videos with Your Words and
Voice
Paper
• 2503.05978
• Published • 36
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published • 89
Video Action Differencing
Paper
• 2503.07860
• Published • 33
UniF^2ace: Fine-grained Face Understanding and Generation
with Unified Multimodal Models
Paper
• 2503.08120
• Published • 31
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
• 2503.08625
• Published • 27
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Paper
• 2503.07604
• Published • 23
LightGen: Efficient Image Generation through Knowledge Distillation and
Direct Preference Optimization
Paper
• 2503.08619
• Published • 20
EasyControl: Adding Efficient and Flexible Control for Diffusion
Transformer
Paper
• 2503.07027
• Published • 30
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted
Contrastive Learning
Paper
• 2503.04812
• Published • 17
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Paper
• 2503.02199
• Published • 8
Seedream 2.0: A Native Chinese-English Bilingual Image Generation
Foundation Model
Paper
• 2503.07703
• Published • 37
Gemini Embedding: Generalizable Embeddings from Gemini
Paper
• 2503.07891
• Published • 48
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
• 2503.08686
• Published • 19
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic
Audiovisual Narrative Processing
Paper
• 2503.06940
• Published • 11
Transformers without Normalization
Paper
• 2503.10622
• Published • 172
Charting and Navigating Hugging Face's Model Atlas
Paper
• 2503.10633
• Published • 94
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
• 2503.10480
• Published • 57
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
• 2503.10639
• Published • 53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published • 36
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large
Language Models
Paper
• 2503.10437
• Published • 34
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Paper
• 2503.09662
• Published • 33
OmniPaint: Mastering Object-Oriented Editing via Disentangled
Insertion-Removal Inpainting
Paper
• 2503.08677
• Published • 29
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
• 2503.10460
• Published • 30
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
• 2503.10596
• Published • 18
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published • 17
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in
$200k
Paper
• 2503.09642
• Published • 20
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference
Time by Leveraging Sparsity
Paper
• 2503.07677
• Published • 86
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper
• 2503.11647
• Published • 148
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories
Generation in End-to-End Autonomous Driving
Paper
• 2503.05689
• Published • 3
SmolDocling: An ultra-compact vision-language model for end-to-end
multi-modal document conversion
Paper
• 2503.11576
• Published • 161
Large-scale Pre-training for Grounded Video Caption Generation
Paper
• 2503.10781
• Published • 16
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
with Interleaved Multimodal Generation via Asymmetric Synergy
Paper
• 2503.06542
• Published • 7
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal
Consistent Video Generation
Paper
• 2503.06053
• Published • 138
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
• 2503.12533
• Published • 68
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale
Text-to-Image Models
Paper
• 2503.12885
• Published • 43
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper
• 2503.13327
• Published • 29
BlobCtrl: A Unified and Flexible Framework for Element-level Image
Generation and Editing
Paper
• 2503.13434
• Published • 28
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published • 30
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published • 35
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published • 32
Aligning Multimodal LLM with Human Preference: A Survey
Paper
• 2503.14504
• Published • 26
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal
Control
Paper
• 2503.14492
• Published • 20
TULIP: Towards Unified Language-Image Pretraining
Paper
• 2503.15485
• Published • 49
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
Exploration and Exploitation
Paper
• 2503.13288
• Published • 51
Temporal Regularization Makes Your Video Generator Stronger
Paper
• 2503.15417
• Published • 22
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
• 2503.11557
• Published • 22
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published • 77
Unleashing Vecset Diffusion Model for Fast Shape Generation
Paper
• 2503.16302
• Published • 43
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Paper
• 2503.14487
• Published • 28
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
• 2503.16365
• Published • 41
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper
• 2503.16418
• Published • 36
Ultra-Resolution Adaptation with Ease
Paper
• 2503.16322
• Published • 13
M3: 3D-Spatial MultiModal Memory
Paper
• 2503.16413
• Published • 15
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language
Balance to Mitigate Dominant Modality Bias
Paper
• 2503.13834
• Published • 5
Expert Race: A Flexible Routing Strategy for Scaling Diffusion
Transformer with Mixture of Experts
Paper
• 2503.16057
• Published • 15
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper
• 2503.14476
• Published • 146
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published • 153
Paper
• 2503.14378
• Published • 61
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
• 2503.16219
• Published • 52
Inside-Out: Hidden Factual Knowledge in LLMs
Paper
• 2503.15299
• Published • 56
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published • 50
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published • 8
MAPS: A Multi-Agent Framework Based on Big Seven Personality and
Socratic Guidance for Multimodal Scientific Problem Solving
Paper
• 2503.16905
• Published • 54
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
• 2503.17352
• Published • 24
Bridging Continuous and Discrete Tokens for Autoregressive Visual
Generation
Paper
• 2503.16430
• Published • 34
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware
Adaptive DPO
Paper
• 2503.16921
• Published • 6
From Head to Tail: Towards Balanced Representation in Large
Vision-Language Models through Adaptive Data Calibration
Paper
• 2503.12821
• Published • 10
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
• 2503.16549
• Published • 15
Why Do Multi-Agent LLM Systems Fail?
Paper
• 2503.13657
• Published • 49
When Less is Enough: Adaptive Token Reduction for Efficient Image
Representation
Paper
• 2503.16660
• Published • 73
Can Large Vision Language Models Read Maps Like a Human?
Paper
• 2503.14607
• Published • 10
GAEA: A Geolocation Aware Conversational Model
Paper
• 2503.16423
• Published • 6
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published • 121
Video-T1: Test-Time Scaling for Video Generation
Paper
• 2503.18942
• Published • 90
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild
Paper
• 2503.18892
• Published • 31
Aether: Geometric-Aware Unified World Modeling
Paper
• 2503.18945
• Published • 28
Judge Anything: MLLM as a Judge Across Any Modality
Paper
• 2503.17489
• Published • 23
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published • 20
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Paper
• 2503.18071
• Published • 3
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published • 31
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
• 2503.18931
• Published • 31
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper
• 2503.19325
• Published • 73
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection
with Artifact Explanation
Paper
• 2503.14905
• Published • 20
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only
Training For Human-Centered Decision Making
Paper
• 2503.16965
• Published • 4
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published • 35
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published • 51
GenHancer: Imperfect Generative Models are Secretly Strong
Vision-Centric Enhancers
Paper
• 2503.19480
• Published • 16
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published • 173
Wan: Open and Advanced Large-Scale Video Generative Models
Paper
• 2503.20314
• Published • 63
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Paper
• 2503.20201
• Published • 48
Beyond Words: Advancing Long-Text Image Generation via Multimodal
Autoregressive Models
Paper
• 2503.20198
• Published • 4
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published • 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published • 62
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published • 23
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
• 2503.21614
• Published • 43
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
• 2503.16081
• Published • 29
Your ViT is Secretly an Image Segmentation Model
Paper
• 2503.19108
• Published • 25
On Large Multimodal Models as Open-World Image Classifiers
Paper
• 2503.21851
• Published • 8
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual
Scenes
Paper
• 2503.23461
• Published • 94
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
• 2503.24379
• Published • 76
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published • 38
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
• 2504.00595
• Published • 38
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on
Elementary School-Level Reasoning Problems?
Paper
• 2504.00509
• Published • 24
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper
• 2503.23307
• Published • 141
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
• 2503.24290
• Published • 62
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Paper
• 2503.22655
• Published • 39
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through
Task Tokenization
Paper
• 2503.19901
• Published • 41
Expanding RL with Verifiable Rewards Across Diverse Domains
Paper
• 2503.23829
• Published • 24
Paper
• 2504.00927
• Published • 56
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts
Paper
• 2503.22952
• Published • 17
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper
• 2504.00557
• Published • 15
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
• 2504.00072
• Published • 6
MergeVQ: A Unified Framework for Visual Generation and Representation
with Disentangled Token Merging and Quantization
Paper
• 2504.00999
• Published • 97
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published • 67
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation
with Hybrid Guidance
Paper
• 2504.01724
• Published • 68
AnimeGamer: Infinite Anime Life Simulation with Next Game State
Prediction
Paper
• 2504.01014
• Published • 70
Towards Physically Plausible Video Generation via VLM Planning
Paper
• 2503.23368
• Published • 40
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
• 2503.20783
• Published • 60
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
• 2504.01934
• Published • 22
Articulated Kinematics Distillation from Video Diffusion Models
Paper
• 2504.01204
• Published • 23
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to
Gaussian Noise in Perturbation-based Attacks
Paper
• 2504.01308
• Published • 14
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper
• 2503.23573
• Published • 12
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal
Representations
Paper
• 2503.18817
• Published • 3
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
• 2504.02826
• Published • 68
WikiVideo: Article Generation from Multiple Videos
Paper
• 2504.00939
• Published • 37
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
• 2504.02782
• Published • 57
Inference-Time Scaling for Generalist Reward Modeling
Paper
• 2504.02495
• Published • 58
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published • 32
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published • 26
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via
Iterative Instruction Tuning and Reinforcement Learning
Paper
• 2504.02949
• Published • 21
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
• 2504.03641
• Published • 14
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Paper
• 2504.01328
• Published • 7
URECA: Unique Region Caption Anything
Paper
• 2504.05305
• Published • 35
Concept Lancet: Image Editing with Compositional Representation
Transplant
Paper
• 2504.02828
• Published • 16
LiveVQA: Live Visual Knowledge Seeking
Paper
• 2504.05288
• Published • 15
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published • 208
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published • 15
Tuning-Free Image Editing with Fidelity and Editability via Unified
Latent Diffusion Model
Paper
• 2504.05594
• Published • 11
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published • 87
Rethinking Reflection in Pre-Training
Paper
• 2504.04022
• Published • 80
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language
Models for Domain-Generalized Semantic Segmentation
Paper
• 2504.03193
• Published • 4
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Paper
• 2504.06263
• Published • 186
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published • 13
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published • 21
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
• 2504.05541
• Published • 16
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published • 13
Paper
• 2504.07491
• Published • 142
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
• 2504.07128
• Published • 87
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published • 46
VisualCloze: A Universal Image Generation Framework via Visual
In-Context Learning
Paper
• 2504.07960
• Published • 50
MM-IFEngine: Towards Multimodal Instruction Following
Paper
• 2504.07957
• Published • 35
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published • 32
Towards Visual Text Grounding of Multimodal Large Language Model
Paper
• 2504.04974
• Published • 18
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
• 2504.08685
• Published • 130
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for
Autoregressive Image Generation
Paper
• 2504.08736
• Published • 46
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
• 2504.08388
• Published • 43
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
• 2504.07615
• Published • 36
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published • 45
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published • 39
Have we unified image generation and understanding yet? An empirical
study of GPT-4o's image generation ability
Paper
• 2504.08003
• Published • 49
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published • 311
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
• 2504.10068
• Published • 30
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published • 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
• 2504.09130
• Published • 12
Reasoning Models Can Be Effective Without Thinking
Paper
• 2504.09858
• Published • 12
The Scalability of Simplicity: Empirical Analysis of Vision-Language
Learning with a Single Transformer
Paper
• 2504.10462
• Published • 16
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
• 2504.10465
• Published • 27
Generate, but Verify: Reducing Hallucination in Vision-Language Models
with Retrospective Resampling
Paper
• 2504.13169
• Published • 39
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference
Optimization for Large Video Models
Paper
• 2504.13122
• Published • 20
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
• 2504.11468
• Published • 30
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
Knowledge
Paper
• 2504.10342
• Published • 11
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Paper
• 2504.10443
• Published • 3
Summarization of Multimodal Presentations with Vision-Language Models:
Study of the Effect of Modalities and Structure
Paper
• 2504.10049
• Published • 2
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published • 48
Perception Encoder: The best visual embeddings are not at the output of
the network
Paper
• 2504.13181
• Published • 37
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published • 19
DMM: Building a Versatile Image Generation Model via Distillation-Based
Model Merging
Paper
• 2504.12364
• Published • 22
PerceptionLM: Open-Access Data and Models for Detailed Visual
Understanding
Paper
• 2504.13180
• Published • 21
Could Thinking Multilingually Empower LLM Reasoning?
Paper
• 2504.11833
• Published • 29
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?
Paper
• 2504.13837
• Published • 141
UFO2: The Desktop AgentOS
Paper
• 2504.14603
• Published • 29
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration
Benchmark
Paper
• 2504.13805
• Published • 11
Vidi: Large Multimodal Models for Video Understanding and Editing
Paper
• 2504.15681
• Published • 14
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper
• 2504.16030
• Published • 38
Seeing from Another Perspective: Evaluating Multi-View Understanding in
MLLMs
Paper
• 2504.15280
• Published • 25
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners
Paper
• 2504.14239
• Published • 14
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
• 2504.15415
• Published • 23
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published • 66
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published • 69
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View
Synthesis
Paper
• 2504.13157
• Published • 20
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls
for Video Generation
Paper
• 2504.14899
• Published • 20
An LMM for Efficient Video Understanding via Reinforced Compression of
Video Cubes
Paper
• 2504.15270
• Published • 9
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
• 2504.14538
• Published • 30
Personalized Text-to-Image Generation with Auto-Regressive Models
Paper
• 2504.13162
• Published • 18
From Reflection to Perfection: Scaling Inference-Time Optimization for
Text-to-Image Diffusion Models via Reflection Tuning
Paper
• 2504.16080
• Published • 15
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Paper
• 2504.16082
• Published • 5
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published • 78
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via
Triplet ID Group Learning
Paper
• 2504.14509
• Published • 53
Trillion 7B Technical Report
Paper
• 2504.15431
• Published • 38
I-Con: A Unifying Framework for Representation Learning
Paper
• 2504.16929
• Published • 31
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models
Paper
• 2504.16074
• Published • 36
DreamO: A Unified Framework for Image Customization
Paper
• 2504.16915
• Published • 24
Progressive Language-guided Visual Learning for Multi-Task Visual
Grounding
Paper
• 2504.16145
• Published • 2
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
• 2504.17192
• Published • 124
Step1X-Edit: A Practical Framework for General Image Editing
Paper
• 2504.17761
• Published • 92
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image
Generation
Paper
• 2504.17502
• Published • 55
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published • 41
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
Simulation
Paper
• 2504.17207
• Published • 30
Token-Shuffle: Towards High-Resolution Image Generation with
Autoregressive Models
Paper
• 2504.17789
• Published • 23
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Paper
• 2504.17040
• Published • 13
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper
• 2504.16064
• Published • 15
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming
Videos
Paper
• 2504.17343
• Published • 13
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting
Paper
• 2504.15921
• Published • 7
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published • 59
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published • 157
Can Large Language Models Help Multimodal Language Analysis? MMLA: A
Comprehensive Benchmark
Paper
• 2504.16427
• Published • 18
DC-SAM: In-Context Segment Anything in Images and Videos via Dual
Consistency
Paper
• 2504.12080
• Published • 8
Contrastive Localized Language-Image Pre-Training
Paper
• 2410.02746
• Published • 36
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified
Multiplet Upcycling
Paper
• 2409.19291
• Published • 21
GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning
LLMs
Paper
• 2410.03645
• Published • 3
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and
Prospects
Paper
• 2504.19838
• Published • 23
RepText: Rendering Visual Text via Replicating
Paper
• 2504.19724
• Published • 31
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual
Dependency
Paper
• 2504.18589
• Published • 13
Clinical knowledge in LLMs does not translate to human interactions
Paper
• 2504.18919
• Published • 26
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
Paper
• 2504.19162
• Published • 18
MMInference: Accelerating Pre-filling for Long-Context VLMs via
Modality-Aware Permutation Sparse Attention
Paper
• 2504.16083
• Published • 8
NORA: A Small Open-Sourced Generalist Vision Language Action Model for
Embodied Tasks
Paper
• 2504.19854
• Published • 7
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published • 99
Paper
• 2504.20879
• Published • 72
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with
Diverse Modalities and Granularities
Paper
• 2504.20734
• Published • 62
YoChameleon: Personalized Vision and Language Generation
Paper
• 2504.20998
• Published • 12
X-Fusion: Introducing New Modality to Frozen Large Language Models
Paper
• 2504.20996
• Published • 13
A Review of 3D Object Detection with Vision-Language Models
Paper
• 2504.18738
• Published • 2
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language
Models in Math
Paper
• 2504.21233
• Published • 50
100 Days After DeepSeek-R1: A Survey on Replication Studies and More
Directions for Reasoning Language Models
Paper
• 2505.00551
• Published • 36
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Paper
• 2504.21850
• Published • 27
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D
Physics Modeling for Complex Motion and Interaction
Paper
• 2504.21855
• Published • 13
A Survey of Interactive Generative Video
Paper
• 2504.21853
• Published • 46
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level
and Token-level CoT
Paper
• 2505.00703
• Published • 44
PixelHacker: Image Inpainting with Structural and Semantic Consistency
Paper
• 2504.20438
• Published • 44
Improving Editability in Image Generation with Layer-wise Memory
Paper
• 2505.01079
• Published • 29
Voila: Voice-Language Foundation Models for Real-Time Autonomous
Interaction and Voice Role-Play
Paper
• 2505.02707
• Published • 85
RM-R1: Reward Modeling as Reasoning
Paper
• 2505.02387
• Published • 81
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement
Learning
Paper
• 2505.02835
• Published • 28
Ming-Lite-Uni: Advancements in Unified Architecture for Natural
Multimodal Interaction
Paper
• 2505.02471
• Published • 15
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based
Image Editing
Paper
• 2505.02370
• Published • 14
Agentic Reasoning and Tool Integration for LLMs via Reinforcement
Learning
Paper
• 2505.01441
• Published • 39
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis
Paper
• 2505.02625
• Published • 23
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene
Generation
Paper
• 2504.21650
• Published • 16
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
• 2505.03318
• Published • 94
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Paper
• 2505.03570
• Published • 8
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue
Resolution
Paper
• 2505.04606
• Published • 9
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision
Encoders for Multimodal Learning
Paper
• 2505.04601
• Published • 29
Beyond Recognition: Evaluating Visual Perspective Taking in Vision
Language Models
Paper
• 2505.03821
• Published • 24
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video
Generation
Paper
• 2505.04512
• Published • 36
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Paper
• 2505.04588
• Published • 66
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published • 82
Scenethesis: A Language and Vision Agentic Framework for 3D Scene
Generation
Paper
• 2505.02836
• Published • 8
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
Large Speech-Language Model
Paper
• 2505.03739
• Published • 10
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
• 2505.04921
• Published • 187
On Path to Multimodal Generalist: General-Level and General-Bench
Paper
• 2505.04620
• Published • 83
Flow-GRPO: Training Flow Matching Models via Online RL
Paper
• 2505.05470
• Published • 89
FG-CLIP: Fine-Grained Visual and Textual Alignment
Paper
• 2505.05071
• Published • 18
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published • 15
Vision-Language-Action Models: Concepts, Progress, Applications and
Challenges
Paper
• 2505.04769
• Published • 10
Bielik v3 Small: Technical Report
Paper
• 2505.02550
• Published • 69
Bielik 11B v2 Technical Report
Paper
• 2505.02410
• Published • 55
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published • 158
Unified Continuous Generative Models
Paper
• 2505.07447
• Published • 42
DanceGRPO: Unleashing GRPO on Visual Generation
Paper
• 2505.07818
• Published • 34
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published • 30
H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor
Learning
Paper
• 2505.07819
• Published • 5
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
Paper
• 2505.06176
• Published • 12
DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for
Dynamic Reranking in Retrieval-Augmented Generation
Paper
• 2505.07233
• Published • 8
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable
Speaker Encoder
Paper
• 2505.07916
• Published • 139
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper
• 2505.08175
• Published • 26
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published • 11
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published • 14
SkillFormer: Unified Multi-View Video Understanding for Proficiency
Estimation
Paper
• 2505.08665
• Published • 5
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published • 100
Insights into DeepSeek-V3: Scaling Challenges and Reflections on
Hardware for AI Architectures
Paper
• 2505.09343
• Published • 78
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal
Mathematical Reasoning
Paper
• 2505.10557
• Published • 51
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Paper
• 2505.04410
• Published • 44
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
Paper
• 2505.09558
• Published • 10
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published • 10
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large
Video Language Models
Paper
• 2505.08455
• Published • 5
Understanding and Mitigating Toxicity in Image-Text Pretraining
Datasets: A Case Study on LLaVA
Paper
• 2505.06356
• Published • 3
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large
Reasoning Models
Paper
• 2505.10554
• Published • 119
OpenThinkIMG: Learning to Think with Images via Visual Tool
Reinforcement Learning
Paper
• 2505.08617
• Published • 42
WorldPM: Scaling Human Preference Modeling
Paper
• 2505.10527
• Published • 34
End-to-End Vision Tokenizer Tuning
Paper
• 2505.10562
• Published • 22
Exploring the Deep Fusion of Large Language Models and Diffusion
Transformers for Text-to-Image Synthesis
Paper
• 2505.10046
• Published • 9
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Paper
• 2505.09926
• Published • 6
Paper
• 2505.09388
• Published • 342
MMLongBench: Benchmarking Long-Context Vision-Language Models
Effectively and Thoroughly
Paper
• 2505.10610
• Published • 56
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Paper
• 2505.11049
• Published • 62
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published • 57
Simple Semi-supervised Knowledge Distillation from Vision-Language
Models via texttt{D}ual-texttt{H}ead
texttt{O}ptimization
Paper
• 2505.07675
• Published • 21
Chain-of-Model Learning for Language Model
Paper
• 2505.11820
• Published • 121
AdaptThink: Reasoning Models Can Learn When to Think
Paper
• 2505.13417
• Published • 83
Model Merging in Pre-training of Large Language Models
Paper
• 2505.12082
• Published • 40
Through the Looking Glass: Common Sense Consistency Evaluation of Weird
Images
Paper
• 2505.07704
• Published • 29
Faster Video Diffusion with Trainable Sparse Attention
Paper
• 2505.13389
• Published • 38
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and
Vision-Language Models
Paper
• 2505.13180
• Published • 13
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
• 2505.12081
• Published • 18
R3: Robust Rubric-Agnostic Reward Models
Paper
• 2505.13388
• Published • 11
Efficient Speech Language Modeling via Energy Distance in Continuous
Latent Space
Paper
• 2505.13181
• Published • 9
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation
through Low-Rank Clone
Paper
• 2505.12781
• Published • 2
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published • 135
Paper
• 2505.14674
• Published • 37
Visual Agentic Reinforcement Fine-Tuning
Paper
• 2505.14246
• Published • 32
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
• 2505.14460
• Published • 33
Think Only When You Need with Large Hybrid-Reasoning Models
Paper
• 2505.14631
• Published • 20
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with
Reinforcement Learning
Paper
• 2505.14677
• Published • 15
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
Paper
• 2505.14135
• Published • 16
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Paper
• 2505.14640
• Published • 16
Two Experts Are All You Need for Steering Thinking: Reinforcing
Cognitive Effort in MoE Reasoning Models Without Additional Training
Paper
• 2505.14681
• Published • 10
Visual Instruction Bottleneck Tuning
Paper
• 2505.13946
• Published • 10
Not All Correct Answers Are Equal: Why Your Distillation Source Matters
Paper
• 2505.14464
• Published • 10
Lessons from Defending Gemini Against Indirect Prompt Injections
Paper
• 2505.14534
• Published • 8
The Hallucination Tax of Reinforcement Finetuning
Paper
• 2505.13988
• Published • 8
Incorporating brain-inspired mechanisms for multimodal learning in
artificial intelligence
Paper
• 2505.10176
• Published • 3
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Paper
• 2505.15277
• Published • 105
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published • 99
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
Learning
Paper
• 2505.14231
• Published • 53
Diffusion vs. Autoregressive Language Models: A Text Embedding
Perspective
Paper
• 2505.15045
• Published • 56
Vid2World: Crafting Video Diffusion Models to Interactive World Models
Paper
• 2505.14357
• Published • 27
When to Continue Thinking: Adaptive Thinking Mode Switching for
Efficient Reasoning
Paper
• 2505.15400
• Published • 23
lmgame-Bench: How Good are LLMs at Playing Games?
Paper
• 2505.15146
• Published • 20
IA-T2I: Internet-Augmented Text-to-Image Generation
Paper
• 2505.15779
• Published • 14
Deliberation on Priors: Trustworthy Reasoning of Large Language Models
on Knowledge Graphs
Paper
• 2505.15210
• Published • 19
RLVR-World: Training World Models with Reinforcement Learning
Paper
• 2505.13934
• Published • 16
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search
with Reasoning via Reinforcement Learning
Paper
• 2505.15776
• Published • 11
HumaniBench: A Human-Centric Framework for Large Multimodal Models
Evaluation
Paper
• 2505.11454
• Published • 5
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published • 42
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper
• 2505.16933
• Published • 34
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
• 2505.17022
• Published • 27
Risk-Averse Reinforcement Learning with Itakura-Saito Loss
Paper
• 2505.16925
• Published • 26
Understanding Generative AI Capabilities in Everyday Image Editing Tasks
Paper
• 2505.16181
• Published • 24
Training-Free Efficient Video Generation via Dynamic Token Carving
Paper
• 2505.16864
• Published • 24
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel
Decoding
Paper
• 2505.16990
• Published • 22
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game
Quality Assurance
Paper
• 2505.15952
• Published • 20
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
• 2505.17018
• Published • 15
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Paper
• 2505.16916
• Published • 17
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement
Learning
Paper
• 2505.16421
• Published • 19
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Paper
• 2505.16839
• Published • 14
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published • 13
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
• 2505.16854
• Published • 11
OViP: Online Vision-Language Preference Learning
Paper
• 2505.15963
• Published • 9
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal
Large Language Models
Paper
• 2505.17015
• Published • 9
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced
Multimodal Chain-of-Thought
Paper
• 2505.16192
• Published • 11
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot
Manipulation Datasets
Paper
• 2505.15517
• Published • 7
How Do Large Vision-Language Models See Text in Image? Unveiling the
Distinctive Role of OCR Heads
Paper
• 2505.15865
• Published • 5
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture
Understanding
Paper
• 2505.14462
• Published • 4
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published • 63
Teaching with Lies: Curriculum DPO on Synthetic Negatives for
Hallucination Detection
Paper
• 2505.17558
• Published • 15
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large
Language Models
Paper
• 2505.16211
• Published • 18
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark
Study
Paper
• 2505.15389
• Published • 8
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal
Large Language Models
Paper
• 2505.18536
• Published • 18
QwenLong-L1: Towards Long-Context Large Reasoning Models with
Reinforcement Learning
Paper
• 2505.17667
• Published • 89
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in
Reasoning Models
Paper
• 2505.17225
• Published • 64
QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization
Paper
• 2505.18092
• Published • 43
RBench-V: A Primary Assessment for Visual Reasoning Models with
Multi-modal Outputs
Paper
• 2505.16770
• Published • 12
Interactive Post-Training for Vision-Language-Action Models
Paper
• 2505.17016
• Published • 6
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
• 2505.13426
• Published • 13
Error Typing for Smarter Rewards: Improving Process Reward Models with
Error-Aware Hierarchical Supervision
Paper
• 2505.19706
• Published • 3
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation
via Reinforcement Learning
Paper
• 2505.17540
• Published • 7
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Paper
• 2505.19147
• Published • 146
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual
Reasoning from Transit Maps
Paper
• 2505.18675
• Published • 28
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
• 2505.20256
• Published • 19
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
Paper
• 2505.20046
• Published • 18
Hard Negative Contrastive Learning for Fine-Grained Geometric
Understanding in Large Multimodal Models
Paper
• 2505.20152
• Published • 11
Interleaved Reasoning for Large Language Models via Reinforcement
Learning
Paper
• 2505.19640
• Published • 15
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer
Interaction
Paper
• 2505.10887
• Published • 10
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Paper
• 2505.15804
• Published • 10
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
• 2505.19084
• Published • 20
Towards Holistic Evaluation of Large Audio-Language Models: A
Comprehensive Survey
Paper
• 2505.15957
• Published • 3
Seeing is Believing, but How Much? A Comprehensive Analysis of
Verbalized Calibration in Vision-Language Models
Paper
• 2505.20236
• Published • 3
Textual Steering Vectors Can Improve Visual Understanding in Multimodal
Large Language Models
Paper
• 2505.14071
• Published • 1
Paper2Poster: Towards Multimodal Poster Automation from Scientific
Papers
Paper
• 2505.21497
• Published • 110
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Paper
• 2505.21374
• Published • 29
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in
Video Scenarios
Paper
• 2505.21333
• Published • 38
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Paper
• 2505.21327
• Published • 83
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Paper
• 2505.16459
• Published • 45
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics
Reasoning
Paper
• 2505.19099
• Published • 7
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
• 2505.21457
• Published • 16
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based
Mobile GUI Agents
Paper
• 2505.21496
• Published • 38
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in
Vision-Language Models
Paper
• 2505.21500
• Published • 13
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic
Scientific Workflows
Paper
• 2505.19897
• Published • 104
MLLMs are Deeply Affected by Modality Bias
Paper
• 2505.18657
• Published • 5
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published • 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
• 2505.22334
• Published • 36
The Entropy Mechanism of Reinforcement Learning for Reasoning Language
Models
Paper
• 2505.22617
• Published • 132
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large
Model Token Routing
Paper
• 2505.21600
• Published • 71
Skywork Open Reasoner 1 Technical Report
Paper
• 2505.22312
• Published • 56
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
• 2505.22651
• Published • 47
Fostering Video Reasoning via Next-Event Prediction
Paper
• 2505.22457
• Published • 29
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Paper
• 2505.22019
• Published • 12
RICO: Improving Accuracy and Completeness in Image Recaptioning via
Visual Reconstruction
Paper
• 2505.22613
• Published • 10
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Paper
• 2505.22664
• Published • 7
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal
Manga Understanding
Paper
• 2505.20298
• Published • 9
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published • 69
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
• 2505.23762
• Published • 45
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in
Learning to Reason
Paper
• 2505.22653
• Published • 43
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC
Videos
Paper
• 2505.23693
• Published • 53
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video
Reasoning?
Paper
• 2505.23359
• Published • 38
To Trust Or Not To Trust Your Vision-Language Model's Prediction
Paper
• 2505.23745
• Published • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of
Pre-trained Multimodal Representation via Text Updates
Paper
• 2505.22943
• Published • 3
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich
Document Understanding
Paper
• 2505.17330
• Published • 22
HoPE: Hybrid of Position Embedding for Length Generalization in
Vision-Language Models
Paper
• 2505.20444
• Published • 5
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement
Learning
Paper
• 2505.22914
• Published • 39
Are Reasoning Models More Prone to Hallucination?
Paper
• 2505.23646
• Published • 24
Multi-Domain Explainability of Preferences
Paper
• 2505.20088
• Published • 20
REOrdering Patches Improves Vision Models
Paper
• 2505.23751
• Published • 15
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical
Reshape
Paper
• 2505.22918
• Published • 6
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Paper
• 2505.23759
• Published • 5
A Graph Perspective to Probe Structural Patterns of Knowledge in Large
Language Models
Paper
• 2505.19286
• Published • 3
Grounded Reinforcement Learning for Visual Reasoning
Paper
• 2505.23678
• Published • 2
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper
• 2505.24867
• Published • 82
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper
• 2505.24863
• Published • 98
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
• 2505.24864
• Published • 146
Large Language Models for Data Synthesis
Paper
• 2505.14752
• Published • 50
Don't Look Only Once: Towards Multimodal Interactive Reasoning with
Selective Visual Revisitation
Paper
• 2505.18842
• Published • 36
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Paper
• 2505.24862
• Published • 30
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Paper
• 2505.24025
• Published • 28
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement
Learning
Paper
• 2505.24871
• Published • 23
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and
Benchmarking Multimodal LLM Agents
Paper
• 2505.24878
• Published • 23
Vision Language Models are Biased
Paper
• 2505.23941
• Published • 23
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published • 13
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual
Large Language Models
Paper
• 2505.20873
• Published • 9
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT
and RL
Paper
• 2505.24875
• Published • 10
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via
Inverting unCLIP
Paper
• 2505.24517
• Published • 5
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient
Robotics
Paper
• 2506.01844
• Published • 161
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion
Models
Paper
• 2506.00996
• Published • 40
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with
Jigsaw Puzzles
Paper
• 2505.23590
• Published • 25
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon
Embodied Tasks
Paper
• 2506.00411
• Published • 32
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
• 2506.01713
• Published • 48
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation
with Large Multimodal Models
Paper
• 2506.01667
• Published • 21
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Paper
• 2505.23977
• Published • 10
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision
Geometry Priors
Paper
• 2505.24625
• Published • 9
OmniResponse: Online Multimodal Conversational Response Generation in
Dyadic Interactions
Paper
• 2505.21724
• Published • 5
Aligning VLM Assistants with Personalized Situated Cognition
Paper
• 2506.00930
• Published • 3
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal
LLMs
Paper
• 2506.01674
• Published • 28
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Paper
• 2506.02096
• Published • 53
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
Multi-Agent Environments
Paper
• 2506.02387
• Published • 58
UniWorld: High-Resolution Semantic Encoders for Unified Visual
Understanding and Generation
Paper
• 2506.03147
• Published • 59
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning
Capabilities of VLMs
Paper
• 2505.24120
• Published • 50
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for
Vision Language Models
Paper
• 2506.03135
• Published • 40
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published • 35
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper
• 2506.03143
• Published • 54
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Paper
• 2506.03096
• Published • 4
TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning
for Enhancing LLMs' Social Intelligence
Paper
• 2505.24500
• Published • 12
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged
Reinforcement Learning
Paper
• 2506.04207
• Published • 48
Paper
• 2506.03569
• Published • 81
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in
Videos
Paper
• 2506.04141
• Published • 31
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video
Reasoning
Paper
• 2506.03525
• Published • 6
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language
Models for Robotics
Paper
• 2506.04308
• Published • 43
Qwen3 Embedding: Advancing Text Embedding and Reranking Through
Foundation Models
Paper
• 2506.05176
• Published • 83
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an
Egocentric World?
Paper
• 2506.05287
• Published • 14
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Paper
• 2506.05344
• Published • 17
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
Contextual Fusion
Paper
• 2506.01111
• Published • 32
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published • 24
Reinforcement Pre-Training
Paper
• 2506.08007
• Published • 265
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal
Learning
Paper
• 2506.06205
• Published • 30
Image Reconstruction as a Tool for Feature Analysis
Paper
• 2506.07803
• Published • 29
Bootstrapping World Models from Dynamics Models in Multimodal Foundation
Models
Paper
• 2506.06006
• Published • 15
Vision Transformers Don't Need Trained Registers
Paper
• 2506.08010
• Published • 22
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical
Understanding and Reasoning
Paper
• 2506.07044
• Published • 114
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for
Parameter-Efficient Video-Text Retrieval
Paper
• 2506.08887
• Published • 4
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand
Better
Paper
• 2506.09040
• Published • 34
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper
• 2506.09113
• Published • 109
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal
Large Language Models
Paper
• 2506.04688
• Published • 3
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Paper
• 2506.06395
• Published • 135
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published • 7
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math
Reasoning
Paper
• 2506.09736
• Published • 9
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Paper
• 2506.10857
• Published • 30
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable
Task Experts
Paper
• 2506.10357
• Published • 21
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Paper
• 2506.09937
• Published • 9
Paper
• 2506.10910
• Published • 69
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper
• 2506.09344
• Published • 33
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper
• 2506.10821
• Published • 19
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal
Gaussian Splatting
Paper
• 2506.09952
• Published • 6
Aligned Novel View Image and Geometry Synthesis via Cross-modal
Attention Instillation
Paper
• 2506.11924
• Published • 35
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual
Perception in VLMs
Paper
• 2506.10128
• Published • 22
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning
Attention
Paper
• 2506.13585
• Published • 278
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning
with Vision-Language Models
Paper
• 2506.07961
• Published • 12
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper
• 2506.13759
• Published • 44
Stream-Omni: Simultaneous Multimodal Interactions with Large
Language-Vision-Speech Model
Paper
• 2506.13642
• Published • 28
VGR: Visual Grounded Reasoning
Paper
• 2506.11991
• Published • 21
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Paper
• 2506.06962
• Published • 28
DoTA-RAG: Dynamic of Thought Aggregation RAG
Paper
• 2506.12571
• Published • 51
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Paper
• 2506.13654
• Published • 44
Scientists' First Exam: Probing Cognitive Abilities of MLLM via
Perception, Understanding, and Reasoning
Paper
• 2506.10521
• Published • 74
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Paper
• 2506.14429
• Published • 44
EfficientVLA: Training-Free Acceleration and Compression for
Vision-Language-Action Models
Paper
• 2506.10100
• Published • 10
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Paper
• 2506.05336
• Published • 10
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
• 2506.14028
• Published • 94
Sekai: A Video Dataset towards World Exploration
Paper
• 2506.15675
• Published • 67
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning
in LLMs
Paper
• 2506.15211
• Published • 39
GenRecal: Generation after Recalibration from Large to Small
Vision-Language Models
Paper
• 2506.15681
• Published • 43
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
• 2506.15569
• Published • 12
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal
Large Language Models
Paper
• 2506.14824
• Published • 8
CoMemo: LVLMs Need Image Context with Image Memory
Paper
• 2506.06279
• Published • 8
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published • 31
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal
Document Understanding
Paper
• 2506.16035
• Published • 89
PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and
Quantized Attention in Visual Generation Models
Paper
• 2506.16054
• Published • 60
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with
Hybrid History Condition
Paper
• 2506.17201
• Published • 55
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens
Paper
• 2506.17218
• Published • 29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
• 2506.17202
• Published • 10
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with
Production-Ready PBR Material
Paper
• 2506.15442
• Published • 18
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video
Understanding
Paper
• 2506.15745
• Published • 14
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert
Aggregation
Paper
• 2506.17113
• Published • 5
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published • 79
Vision as a Dialect: Unifying Visual Understanding and Generation via
Text-Aligned Representations
Paper
• 2506.18898
• Published • 35
From Intention to Execution: Probing the Generalization Boundaries of
Vision-Language-Action Models
Paper
• 2506.09930
• Published • 8
USAD: Universal Speech and Audio Representation via Distillation
Paper
• 2506.18843
• Published • 13
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality
Debiasing
Paper
• 2506.19848
• Published • 27
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published • 28
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning
Paper
• 2506.16141
• Published • 27
Phantom-Data : Towards a General Subject-Consistent Video Generation
Dataset
Paper
• 2506.18851
• Published • 29
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation
Paper
• 2506.18095
• Published • 67
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Paper
• 2506.20512
• Published • 49
MMSearch-R1: Incentivizing LMMs to Search
Paper
• 2506.20670
• Published • 65
WorldVLA: Towards Autoregressive Action World Model
Paper
• 2506.21539
• Published • 40
FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient
Multi-turn Image Editing
Paper
• 2506.20911
• Published • 41
LLaVA-Scissor: Token Compression with Semantic Connected Components for
Video LLMs
Paper
• 2506.21862
• Published • 36
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Paper
• 2506.21656
• Published • 16
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper
• 2506.22434
• Published • 10
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Paper
• 2506.17450
• Published • 64
ShotBench: Expert-Level Cinematic Understanding in Vision-Language
Models
Paper
• 2506.21356
• Published • 22
Audio-FLAN: A Preliminary Release
Paper
• 2502.16584
• Published • 36
Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation
Paper
• 2506.21876
• Published • 28
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in
Inference-time Scaling?
Paper
• 2506.17417
• Published • 11
Paper
• 2506.23044
• Published • 63
Listener-Rewarded Thinking in VLMs for Image Preferences
Paper
• 2506.22832
• Published • 22
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published • 256
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional
Multimodal Embeddings
Paper
• 2506.23115
• Published • 36
MusiXQA: Advancing Visual Music Understanding in Multimodal Large
Language Models
Paper
• 2506.23009
• Published • 11
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published • 133
A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective
Paper
• 2507.01925
• Published • 39
LongAnimation: Long Animation Generation with Dynamic Global-Local
Memory
Paper
• 2507.01945
• Published • 75
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published • 90
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Models on Standard Computer Vision Tasks
Paper
• 2507.01955
• Published • 36
MemOS: A Memory OS for AI System
Paper
• 2507.03724
• Published • 167
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive
World Knowledge
Paper
• 2507.04447
• Published • 45
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning
Dataset
Paper
• 2507.03483
• Published • 24
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code
Generation Evaluation
Paper
• 2507.04952
• Published • 11
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
• 2507.04590
• Published • 17
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper
• 2507.06448
• Published • 48
4KAgent: Agentic Any Image to 4K Super-Resolution
Paper
• 2507.07105
• Published • 107
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Paper
• 2507.07095
• Published • 56
Scaling RL to Long Videos
Paper
• 2507.07966
• Published • 161
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and
Methodology
Paper
• 2507.07999
• Published • 51
PyVision: Agentic Vision with Dynamic Tooling
Paper
• 2507.07998
• Published • 33
Multi-Granular Spatio-Temporal Token Merging for Training-Free
Acceleration of Video LLMs
Paper
• 2507.07990
• Published •