zerozeyi
's Collections
VisionLM
updated
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
•
2402.04252
•
Published
•
25
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
•
2402.03749
•
Published
•
12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
•
2402.05008
•
Published
•
20
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
•
2402.05930
•
Published
•
38
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
15
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
13
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
•
2402.07456
•
Published
•
41
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
•
2402.07872
•
Published
•
15
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
12
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
37
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
•
2402.10896
•
Published
•
15
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
77
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
41
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
•
2402.11248
•
Published
•
20
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
•
2402.11690
•
Published
•
8
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
•
2402.13217
•
Published
•
23
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
•
2402.13250
•
Published
•
25
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
13
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
•
2402.13220
•
Published
•
13
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
8
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
•
2402.14818
•
Published
•
23
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
•
2402.17177
•
Published
•
88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
•
2402.19479
•
Published
•
32
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
•
2403.01422
•
Published
•
26
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
•
2403.01487
•
Published
•
14
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
•
2403.02677
•
Published
•
16
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
•
2403.02626
•
Published
•
9
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
•
2403.03194
•
Published
•
12
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
•
2403.03003
•
Published
•
9
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
124
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
74
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
•
2403.07750
•
Published
•
21
DragAnything: Motion Control for Anything using Entity Representation
Paper
•
2403.07420
•
Published
•
13
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
•
2403.06764
•
Published
•
26
VideoMamba: State Space Model for Efficient Video Understanding
Paper
•
2403.06977
•
Published
•
27
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
•
2403.05530
•
Published
•
61
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
•
2403.05438
•
Published
•
18
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
52
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
•
2403.10517
•
Published
•
32
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
•
2403.11481
•
Published
•
12
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
•
2403.12895
•
Published
•
31
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
9
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
51
Can large language models explore in-context?
Paper
•
2403.15371
•
Published
•
32
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
•
2403.15377
•
Published
•
22
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
•
2403.15360
•
Published
•
11
VidLA: Video-Language Alignment at Scale
Paper
•
2403.14870
•
Published
•
12
ViTAR: Vision Transformer with Any Resolution
Paper
•
2403.18361
•
Published
•
52
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
44
sDPO: Don't Use Your Data All at Once
Paper
•
2403.19270
•
Published
•
40
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
•
2403.18978
•
Published
•
13
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
14
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
30
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
•
2404.01258
•
Published
•
10
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
•
2404.03413
•
Published
•
25
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
23
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
•
2404.03653
•
Published
•
33
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
•
2404.05719
•
Published
•
81
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
•
2404.05726
•
Published
•
20
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
•
2404.05674
•
Published
•
13
Koala: Key frame-conditioned long video-LLM
Paper
•
2404.04346
•
Published
•
5
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
Adapting LLaMA Decoder to Vision Transformer
Paper
•
2404.06773
•
Published
•
17
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
18
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
•
2404.07448
•
Published
•
11
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
•
2404.09990
•
Published
•
12
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
•
2404.09204
•
Published
•
10
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
13
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
•
2404.14239
•
Published
•
8
A Multimodal Automated Interpretability Agent
Paper
•
2404.14394
•
Published
•
20
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
•
2404.12803
•
Published
•
29
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
30
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
•
2404.15653
•
Published
•
26
Editable Image Elements for Controllable Synthesis
Paper
•
2404.16029
•
Published
•
10
MoDE: CLIP Data Experts via Clustering
Paper
•
2404.16030
•
Published
•
12
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
55
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
•
2404.16375
•
Published
•
16
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
•
2404.16845
•
Published
•
6
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
•
2404.17672
•
Published
•
18
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
•
2404.17521
•
Published
•
12
Automatic Creative Selection with Cross-Modal Matching
Paper
•
2405.00029
•
Published
•
7
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
100
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
•
2405.07990
•
Published
•
16
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
•
2405.08344
•
Published
•
12
Understanding the performance gap between online and offline alignment
algorithms
Paper
•
2405.08448
•
Published
•
14
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
•
2405.08295
•
Published
•
14
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
•
2405.08317
•
Published
•
9
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
•
2405.09215
•
Published
•
18
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
126
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
26
Toon3D: Seeing Cartoons from a New Perspective
Paper
•
2405.10320
•
Published
•
19
Octo: An Open-Source Generalist Robot Policy
Paper
•
2405.12213
•
Published
•
24
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
•
2405.12107
•
Published
•
25
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
149
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
27
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
12
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
•
2405.13195
•
Published
•
9
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
53
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
•
2405.15216
•
Published
•
12
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
86
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
31
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
•
2405.17428
•
Published
•
17
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
21
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
•
2405.14598
•
Published
•
11
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
•
2405.20204
•
Published
•
34
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
•
2405.18669
•
Published
•
11
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
20
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
•
2406.00888
•
Published
•
30
Parrot: Multilingual Visual Instruction Tuning
Paper
•
2406.02539
•
Published
•
35
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
•
2406.02884
•
Published
•
15
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
72
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
•
2406.04151
•
Published
•
17
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
•
2406.01014
•
Published
•
31
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
25
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
•
2406.07550
•
Published
•
55
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
•
2406.06911
•
Published
•
10
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
•
2406.07476
•
Published
•
32
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
•
2406.08478
•
Published
•
39
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
•
2406.08407
•
Published
•
24
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
52
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
37
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
•
2406.11816
•
Published
•
22
TroL: Traversal of Layers for Large Language and Vision Models
Paper
•
2406.12246
•
Published
•
34
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
•
2406.12275
•
Published
•
29
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
•
2406.12742
•
Published
•
14
Adversarial Attacks on Multimodal Agents
Paper
•
2406.12814
•
Published
•
4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
•
2406.11230
•
Published
•
33
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
•
2406.12649
•
Published
•
15
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
•
2406.09358
•
Published
•
4
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
•
2406.09356
•
Published
•
4
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
•
2406.09406
•
Published
•
13
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
•
2406.09411
•
Published
•
18
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
15
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
•
2406.09162
•
Published
•
13
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
23
Paper
•
2406.04127
•
Published
•
37
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
•
2406.06523
•
Published
•
50
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
•
2406.08487
•
Published
•
11
VCR: Visual Caption Restoration
Paper
•
2406.06462
•
Published
•
10
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
36
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
23
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
•
2406.04338
•
Published
•
34
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
•
2406.05074
•
Published
•
6
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
•
2406.10210
•
Published
•
76
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
86
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
•
2406.11831
•
Published
•
21
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
•
2406.10328
•
Published
•
17
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
•
2406.14544
•
Published
•
34
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
•
2406.11069
•
Published
•
13
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
20
Paper
•
2406.11775
•
Published
•
8
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
•
2406.11251
•
Published
•
9
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
65
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
•
2406.14515
•
Published
•
32
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
•
2406.14035
•
Published
•
12
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
•
2406.14596
•
Published
•
5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
•
2406.11403
•
Published
•
4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
•
2406.16338
•
Published
•
25
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
32
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
58
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
•
2406.17770
•
Published
•
18
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
•
2406.15704
•
Published
•
5
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
•
2406.18082
•
Published
•
47
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
•
2406.18521
•
Published
•
28
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
•
2406.15334
•
Published
•
8
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
10
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
52
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
•
2406.18629
•
Published
•
41
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
•
2406.18790
•
Published
•
33
Simulating Classroom Education with LLM-Empowered Agents
Paper
•
2406.19226
•
Published
•
30
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for
Vision-Language Models
Paper
•
2406.10900
•
Published
•
11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
•
2406.20095
•
Published
•
17
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
•
2406.20076
•
Published
•
8
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
•
2406.17720
•
Published
•
7
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
75
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
•
2406.19741
•
Published
•
59
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and
Efficient Evaluation
Paper
•
2407.00468
•
Published
•
34
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
•
2407.01449
•
Published
•
42
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
•
2407.00114
•
Published
•
12
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
93
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
50
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
Streams
Paper
•
2406.08085
•
Published
•
13
Granular Privacy Control for Geolocation with Vision Language Models
Paper
•
2407.04952
•
Published
•
4
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
20
Multi-Object Hallucination in Vision-Language Models
Paper
•
2407.06192
•
Published
•
9
Vision language models are blind
Paper
•
2407.06581
•
Published
•
82
VIMI: Grounding Video Generation through Multi-modal Instruction
Paper
•
2407.06304
•
Published
•
9
Video-to-Audio Generation with Hidden Alignment
Paper
•
2407.07464
•
Published
•
16
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
•
2407.03958
•
Published
•
18
Understanding Visual Feature Reliance through the Lens of Complexity
Paper
•
2407.06076
•
Published
•
5
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting
Region Captions
Paper
•
2407.06723
•
Published
•
10
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
68
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
Do Vision and Language Models Share Concepts? A Vector Space Alignment
Study
Paper
•
2302.06555
•
Published
•
9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
Perception
Paper
•
2407.08303
•
Published
•
17
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
•
2407.08583
•
Published
•
10
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
42
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
•
2407.12580
•
Published
•
39
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
•
2407.12679
•
Published
•
7
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Paper
•
2407.09018
•
Published
•
5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in
Clutter
Paper
•
2407.11298
•
Published
•
5
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-Language Models
Paper
•
2407.12366
•
Published
•
4
Benchmarking Trustworthiness of Multimodal Large Language Models: A
Comprehensive Study
Paper
•
2406.07057
•
Published
•
15
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
•
2407.14177
•
Published
•
42
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
•
2407.12594
•
Published
•
19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
40
VideoGameBunny: Towards vision assistants for video games
Paper
•
2407.15295
•
Published
•
21
CGB-DM: Content and Graphic Balance Layout Generation with
Transformer-based Diffusion Model
Paper
•
2407.15233
•
Published
•
6
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
27
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
•
2407.16655
•
Published
•
29
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
39
Learning to Manipulate Anywhere: A Visual Generalizable Framework For
Reinforcement Learning
Paper
•
2407.15815
•
Published
•
13
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
•
2407.17490
•
Published
•
30
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
•
2407.18121
•
Published
•
16
VSSD: Vision Mamba with Non-Casual State Space Duality
Paper
•
2407.18559
•
Published
•
19
Wolf: Captioning Everything with a World Summarization Framework
Paper
•
2407.18908
•
Published
•
31
Diffusion Feedback Helps CLIP See Better
Paper
•
2407.20171
•
Published
•
36
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
•
2407.19795
•
Published
•
11
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
•
2407.19985
•
Published
•
36
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
•
2407.21770
•
Published
•
22
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
•
2407.21646
•
Published
•
18
ShieldGemma: Generative AI Content Moderation Based on Gemma
Paper
•
2407.21772
•
Published
•
14
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
•
2407.21721
•
Published
•
8
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
109
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
24
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
•
2407.21794
•
Published
•
5
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
79
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
•
2408.02657
•
Published
•
33
Language Model Can Listen While Speaking
Paper
•
2408.02622
•
Published
•
37
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
7
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Paper
•
2408.02373
•
Published
•
4
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
59
Diffusion Models as Data Mining Tools
Paper
•
2408.02752
•
Published
•
13
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
•
2408.01708
•
Published
•
3
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in
Long-Horizon Tasks
Paper
•
2408.03615
•
Published
•
30
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
•
2408.03695
•
Published
•
12
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
•
2408.03900
•
Published
•
9
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
24
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
•
2408.04594
•
Published
•
14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior
for Part-Level Dynamics
Paper
•
2408.04631
•
Published
•
8
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
47
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
32
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
Scaling
Paper
•
2408.04810
•
Published
•
22
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
•
2408.06070
•
Published
•
53
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
•
2408.06327
•
Published
•
16
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization
Paper
•
2408.05939
•
Published
•
13
Paper
•
2408.07009
•
Published
•
61
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
•
2408.06663
•
Published
•
15
Paper
•
2408.05366
•
Published
•
11
Towards flexible perception with visual memory
Paper
•
2408.08172
•
Published
•
20
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
98
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Paper
•
2408.08459
•
Published
•
45
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
•
2408.08441
•
Published
•
7
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
•
2408.11001
•
Published
•
11
Factorized-Dreamer: Training A High-Quality Video Generator with Limited
and Low-Quality Data
Paper
•
2408.10119
•
Published
•
16
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
58
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with
Patch Neighbor Consistency
Paper
•
2408.11054
•
Published
•
12
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
for Efficient Inference Intervention in Large Language Model
Paper
•
2408.10764
•
Published
•
8
Audio Match Cutting: Finding and Creating Matching Audio Transitions in
Movies and Videos
Paper
•
2408.10998
•
Published
•
8
MambaEVT: Event Stream based Visual Object Tracking using State Space
Model
Paper
•
2408.10487
•
Published
•
6
FocusLLM: Scaling LLM's Context by Parallel Decoding
Paper
•
2408.11745
•
Published
•
23
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
•
2408.11318
•
Published
•
55
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
8
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive
Prompt Weighting
Paper
•
2408.11706
•
Published
•
6
TrackGo: A Flexible and Efficient Method for Controllable Video
Generation
Paper
•
2408.11475
•
Published
•
17
Out-of-Distribution Detection with Attention Head Masking for Multimodal
Document Classification
Paper
•
2408.11237
•
Published
•
5
Iterative Object Count Optimization for Text-to-image Diffusion Models
Paper
•
2408.11721
•
Published
•
5
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
89
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
•
2408.11878
•
Published
•
52
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
•
2408.12590
•
Published
•
35
Scalable Autoregressive Image Generation with Mamba
Paper
•
2408.12245
•
Published
•
25
Real-Time Video Generation with Pyramid Attention Broadcast
Paper
•
2408.12588
•
Published
•
15
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
•
2408.12114
•
Published
•
12
Anim-Director: A Large Multimodal Model Powered Agent for Controllable
Animation Video Generation
Paper
•
2408.09787
•
Published
•
7
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
123
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
25
CustomCrafter: Customized Video Generation with Preserving Motion and
Concept Composition Abilities
Paper
•
2408.13239
•
Published
•
11
Foundation Models for Music: A Survey
Paper
•
2408.14340
•
Published
•
43
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Paper
•
2408.13402
•
Published
•
18
TVG: A Training-free Transition Video Generation Method with Diffusion
Models
Paper
•
2408.13413
•
Published
•
14
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and
Deduplication by Introducing a Competitive Large Language Model Baseline
Paper
•
2408.15079
•
Published
•
52
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
47
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
52
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
•
2408.17253
•
Published
•
37
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
•
2408.09174
•
Published
•
51
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
•
2409.01071
•
Published
•
27
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
•
2409.02095
•
Published
•
35
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
•
2409.02097
•
Published
•
32
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
55
Attention Heads of Large Language Models: A Survey
Paper
•
2409.03752
•
Published
•
88
Open-MAGVIT2: An Open-Source Project Toward Democratizing
Auto-regressive Visual Generation
Paper
•
2409.04410
•
Published
•
23
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
•
2409.05840
•
Published
•
45
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
•
2409.02795
•
Published
•
71
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
•
2409.04828
•
Published
•
22
Benchmarking Chinese Knowledge Rectification in Large Language Models
Paper
•
2409.05806
•
Published
•
13
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
•
2409.06666
•
Published
•
55
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
•
2409.06135
•
Published
•
14
PingPong: A Benchmark for Role-Playing Language Models with User
Emulation and Multi-Model Evaluation
Paper
•
2409.06820
•
Published
•
63
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View
Synthesis
Paper
•
2409.07129
•
Published
•
6
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
•
2409.07239
•
Published
•
11
Ferret: Federated Full-Parameter Tuning at Scale for Large Language
Models
Paper
•
2409.06277
•
Published
•
14
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
•
2409.09269
•
Published
•
7
One missing piece in Vision and Language: A Survey on Comics
Understanding
Paper
•
2409.09502
•
Published
•
23
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
72
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
108
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
•
2409.11355
•
Published
•
28
OSV: One Step is Enough for High-Quality Image to Video Generation
Paper
•
2409.11367
•
Published
•
13
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
26
InstantDrag: Improving Interactivity in Drag-based Image Editing
Paper
•
2409.08857
•
Published
•
31
AudioBERT: Audio Knowledge Augmented Language Model
Paper
•
2409.08199
•
Published
•
4
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Paper
•
2409.08554
•
Published
•
3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
74
Qwen2.5-Coder Technical Report
Paper
•
2409.12186
•
Published
•
138
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
•
2409.11564
•
Published
•
19
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
•
2409.12139
•
Published
•
12
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
•
2409.12961
•
Published
•
24
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
•
2409.12576
•
Published
•
15
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
•
2409.13346
•
Published
•
68
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
•
2409.13592
•
Published
•
48
Portrait Video Editing Empowered by Multimodal Generative Priors
Paper
•
2409.13591
•
Published
•
15
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
•
2409.15278
•
Published
•
22
Phantom of Latent for Large Language and Vision Models
Paper
•
2409.14713
•
Published
•
27
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror
Reflections
Paper
•
2409.14677
•
Published
•
14
MIMO: Controllable Character Video Synthesis with Spatial Decomposed
Modeling
Paper
•
2409.16160
•
Published
•
32
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
•
2409.16280
•
Published
•
17
Seeing Faces in Things: A Model and Dataset for Pareidolia
Paper
•
2409.16143
•
Published
•
15
Attention Prompting on Image for Large Vision-Language Models
Paper
•
2409.17143
•
Published
•
7
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
104
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
52
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
52
Visual Question Decomposition on Multimodal Large Language Models
Paper
•
2409.19339
•
Published
•
7
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
•
2410.02757
•
Published
•
36
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
•
2410.02712
•
Published
•
35
Interpreting and Editing Vision-Language Representations to Mitigate
Hallucinations
Paper
•
2410.02762
•
Published
•
9
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
•
2410.02763
•
Published
•
7
Addition is All You Need for Energy-efficient Language Models
Paper
•
2410.00907
•
Published
•
144
VideoGuide: Improving Video Diffusion Models without Training Through a
Teacher's Guide
Paper
•
2410.04364
•
Published
•
28
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
•
2410.05243
•
Published
•
17
UniMuMo: Unified Text, Music and Motion Generation
Paper
•
2410.04534
•
Published
•
18
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
•
2410.04734
•
Published
•
16
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal
Instruction
Paper
•
2410.04932
•
Published
•
9
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
•
2410.01912
•
Published
•
13
ControlAR: Controllable Image Generation with Autoregressive Models
Paper
•
2410.02705
•
Published
•
9
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
•
2410.03290
•
Published
•
6
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
•
2410.05993
•
Published
•
107
Personalized Visual Instruction Tuning
Paper
•
2410.07113
•
Published
•
69
Paper
•
2410.07073
•
Published
•
62
IterComp: Iterative Composition-Aware Feedback Learning from Model
Gallery for Text-to-Image Generation
Paper
•
2410.07171
•
Published
•
41
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
•
2410.07167
•
Published
•
37
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation
Learning
Paper
•
2410.06373
•
Published
•
35
Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper
•
2410.05954
•
Published
•
38
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
•
2410.05363
•
Published
•
44
Story-Adapter: A Training-free Iterative Framework for Long Story
Visualization
Paper
•
2410.06244
•
Published
•
19
MM-Ego: Towards Building Egocentric Multimodal LLMs
Paper
•
2410.07177
•
Published
•
20
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based
Image/Video Generation
Paper
•
2410.05591
•
Published
•
13
Temporal Reasoning Transfer from Text to Video
Paper
•
2410.06166
•
Published
•
12
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
•
2410.03450
•
Published
•
36
Intriguing Properties of Large Language and Vision Models
Paper
•
2410.04751
•
Published
•
16
Progressive Autoregressive Video Diffusion Models
Paper
•
2410.08151
•
Published
•
15
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
Vision-Linguistic Compositionality
Paper
•
2410.05210
•
Published
•
10
Self-Boosting Large Language Models with Synthetic Preference Data
Paper
•
2410.06961
•
Published
•
15
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM
Agents
Paper
•
2410.07484
•
Published
•
48
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
•
2410.08164
•
Published
•
24
GLOV: Guided Large Language Models as Implicit Optimizers for Vision
Language Models
Paper
•
2410.06154
•
Published
•
16
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
84
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
•
2410.06456
•
Published
•
35
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large
Vision-Language Models
Paper
•
2410.07133
•
Published
•
18
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
•
2410.10139
•
Published
•
51
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
•
2410.10594
•
Published
•
24
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
•
2410.11779
•
Published
•
24
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
•
2410.10816
•
Published
•
19
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Paper
•
2410.11817
•
Published
•
14
OMCAT: Omni Context Aware Transformer
Paper
•
2410.12109
•
Published
•
4
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
Embodied AI
Paper
•
2410.11623
•
Published
•
46
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of
Large Multimodal Models Through Coding Tasks
Paper
•
2410.12381
•
Published
•
42
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
30
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
30
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
•
2410.13824
•
Published
•
29
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
•
2410.12705
•
Published
•
29
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
•
2410.13863
•
Published
•
36
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper
•
2410.13757
•
Published
•
31
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
•
2410.13268
•
Published
•
33
Movie Gen: A Cast of Media Foundation Models
Paper
•
2410.13720
•
Published
•
89
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise
Motion Control
Paper
•
2410.13830
•
Published
•
23
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
•
2410.13085
•
Published
•
20
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Paper
•
2410.13639
•
Published
•
16
VidPanos: Generative Panoramic Videos from Casual Panning Videos
Paper
•
2410.13832
•
Published
•
12
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
•
2410.13360
•
Published
•
8
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
•
2410.13859
•
Published
•
7
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Paper
•
2410.13854
•
Published
•
10
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
Model
Paper
•
2410.13925
•
Published
•
22
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
•
2410.11190
•
Published
•
20
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Paper
•
2410.14745
•
Published
•
45
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
•
2410.16268
•
Published
•
65
Baichuan Alignment Technical Report
Paper
•
2410.14940
•
Published
•
49
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
52
Toward Guidance-Free AR Visual Generation via Condition Contrastive
Alignment
Paper
•
2410.09347
•
Published
•
4
AutoTrain: No-code training for state-of-the-art models
Paper
•
2410.15735
•
Published
•
58
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
and Style
Paper
•
2410.16184
•
Published
•
23
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
•
2410.15316
•
Published
•
10
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
•
2410.17247
•
Published
•
45
Aligning Large Language Models via Self-Steering Optimization
Paper
•
2410.17131
•
Published
•
21
Improve Vision Language Model Chain-of-thought Reasoning
Paper
•
2410.16198
•
Published
•
22
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video
Even in VLMs
Paper
•
2410.16267
•
Published
•
17
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
•
2410.17637
•
Published
•
34
Can Knowledge Editing Really Correct Hallucinations?
Paper
•
2410.16251
•
Published
•
54
LOGO -- Long cOntext aliGnment via efficient preference Optimization
Paper
•
2410.18533
•
Published
•
42
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
•
2410.18798
•
Published
•
19
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
•
2410.18558
•
Published
•
18
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
•
2410.17779
•
Published
•
7
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
•
2410.17856
•
Published
•
49
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
•
2410.16048
•
Published
•
29
Paper
•
2410.21276
•
Published
•
82
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
•
2410.21220
•
Published
•
10
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
200
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Paper
•
2410.22587
•
Published
•
9
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Paper
•
2410.23287
•
Published
•
19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
46
Personalization of Large Language Models: A Survey
Paper
•
2411.00027
•
Published
•
31
Randomized Autoregressive Visual Generation
Paper
•
2411.00776
•
Published
•
17
Face Anonymization Made Simple
Paper
•
2411.00762
•
Published
•
7
AndroidLab: Training and Systematic Benchmarking of Android Autonomous
Agents
Paper
•
2410.24024
•
Published
•
48
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum
Reinforcement Learning
Paper
•
2411.02337
•
Published
•
35
How Far is Video Generation from World Model: A Physical Law Perspective
Paper
•
2411.02385
•
Published
•
33
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent
Paper
•
2411.02265
•
Published
•
24
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Paper
•
2411.02397
•
Published
•
23
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions
Paper
•
2411.02394
•
Published
•
17
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for
Efficient Robot Execution
Paper
•
2411.02359
•
Published
•
12
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
•
2411.03823
•
Published
•
43
Adaptive Length Image Tokenization via Recurrent Allocation
Paper
•
2411.02393
•
Published
•
12
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning
Paper
•
2411.05003
•
Published
•
70
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation
Paper
•
2411.04709
•
Published
•
25
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page
Multi-document Understanding
Paper
•
2411.04952
•
Published
•
28
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale
Haystacks?
Paper
•
2411.05000
•
Published
•
21
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
•
2411.04923
•
Published
•
20
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
22
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
37
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned
Vision-Language Models
Paper
•
2411.04097
•
Published
•
5
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
•
2411.07199
•
Published
•
45
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language
Models
Paper
•
2411.07140
•
Published
•
33
Edify Image: High-Quality Image Generation with Pixel Space Laplacian
Diffusion Models
Paper
•
2411.07126
•
Published
•
28
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
•
2411.07232
•
Published
•
62
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
•
2411.07975
•
Published
•
27
Autoregressive Models in Vision: A Survey
Paper
•
2411.05902
•
Published
•
16
MagicQuill: An Intelligent Interactive Image Editing System
Paper
•
2411.09703
•
Published
•
57
Sharingan: Extract User Action Sequence from Desktop Recordings
Paper
•
2411.08768
•
Published
•
10
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
109
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
•
2411.06558
•
Published
•
34
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
•
2411.10323
•
Published
•
31
Number it: Temporal Grounding Videos like Flipping Manga
Paper
•
2411.10332
•
Published
•
13
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
•
2411.10640
•
Published
•
44
Generative World Explorer
Paper
•
2411.11844
•
Published
•
74
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
•
2411.10836
•
Published
•
23
SlimLM: An Efficient Small Language Model for On-Device Document
Assistance
Paper
•
2411.09944
•
Published
•
12
Adaptive Decoding via Latent Preference Optimization
Paper
•
2411.09661
•
Published
•
10
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
Paper
•
2411.11045
•
Published
•
11
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
47
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization
Paper
•
2411.11909
•
Published
•
20
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
•
2411.10818
•
Published
•
24
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text,
and Architectural Enhancements
Paper
•
2411.12044
•
Published
•
13
Continuous Speculative Decoding for Autoregressive Image Generation
Paper
•
2411.11925
•
Published
•
15
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
•
2411.10442
•
Published
•
63
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
41
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
•
2411.14432
•
Published
•
20
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
•
2411.14982
•
Published
•
15
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple
Distillation, Big Progress or Bitter Lesson?
Paper
•
2411.16489
•
Published
•
40
One Diffusion to Generate Them All
Paper
•
2411.16318
•
Published
•
26
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
•
2411.16657
•
Published
•
17
Factorized Visual Tokenization and Generation
Paper
•
2411.16681
•
Published
•
17
TEXGen: a Generative Diffusion Model for Mesh Textures
Paper
•
2411.14740
•
Published
•
15
ROICtrl: Boosting Instance Control for Visual Generation
Paper
•
2411.17949
•
Published
•
82
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
76
SketchAgent: Language-Driven Sequential Sketch Generation
Paper
•
2411.17673
•
Published
•
18
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
•
2411.17686
•
Published
•
18
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
•
2411.15296
•
Published
•
19
Large Language Model-Brained GUI Agents: A Survey
Paper
•
2411.18279
•
Published
•
27
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
•
2411.18203
•
Published
•
30
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
•
2411.19930
•
Published
•
24
Yi-Lightning Technical Report
Paper
•
2412.01253
•
Published
•
23
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
•
2412.01824
•
Published
•
65
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
26
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
•
2412.00131
•
Published
•
32
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction
with 3D Autonomous Characters
Paper
•
2412.00174
•
Published
•
22
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
•
2412.00947
•
Published
•
7
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
22
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
•
2412.03555
•
Published
•
118
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
•
2412.03069
•
Published
•
30
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
•
2412.00493
•
Published
•
16
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
Prompt Instruction Tuning
Paper
•
2412.03565
•
Published
•
11
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
103
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
54
NVILA: Efficient Frontier Visual Language Models
Paper
•
2412.04468
•
Published
•
53
Negative Token Merging: Image-based Adversarial Feature Guidance
Paper
•
2412.01339
•
Published
•
21
Personalized Multimodal Large Language Models: A Survey
Paper
•
2412.02142
•
Published
•
12
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
•
2412.01169
•
Published
•
10
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Paper
•
2412.04449
•
Published
•
6
Scaling Inference-Time Search with Vision Value Model for Improved
Visual Comprehension
Paper
•
2412.03704
•
Published
•
6
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
115
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
•
2412.05237
•
Published
•
44
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper
•
2412.04814
•
Published
•
44
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
•
2412.04301
•
Published
•
32
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
•
2412.05243
•
Published
•
18
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Paper
•
2412.05263
•
Published
•
10
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
•
2412.04626
•
Published
•
10
Training Large Language Models to Reason in a Continuous Latent Space
Paper
•
2412.06769
•
Published
•
56
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation
Paper
•
2412.06781
•
Published
•
18
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
•
2412.07112
•
Published
•
24
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper
•
2412.04432
•
Published
•
13
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
•
2412.05939
•
Published
•
12
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
•
2412.07589
•
Published
•
45
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Paper
•
2412.03548
•
Published
•
16
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
38
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
•
2412.08580
•
Published
•
43
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
•
2412.07147
•
Published
•
5
StreamChat: Chatting with Streaming Video
Paper
•
2412.08646
•
Published
•
17
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
•
2412.09596
•
Published
•
89
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
51
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Paper
•
2412.09501
•
Published
•
43
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
•
2412.08635
•
Published
•
39
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via
Multimodal LLM
Paper
•
2412.09618
•
Published
•
21
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
•
2412.08687
•
Published
•
11
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Paper
•
2412.09013
•
Published
•
10
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
•
2412.10360
•
Published
•
130
GenEx: Generating an Explorable World
Paper
•
2412.09624
•
Published
•
83
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
•
2412.09428
•
Published
•
7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
•
2412.09604
•
Published
•
35
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
68
BrushEdit: All-In-One Image Inpainting and Editing
Paper
•
2412.10316
•
Published
•
33
VidTok: A Versatile and Open-Source Video Tokenizer
Paper
•
2412.13061
•
Published
•
6
Paper
•
2412.13501
•
Published
•
17
Progressive Multimodal Reasoning via Active Retrieval
Paper
•
2412.14835
•
Published
•
53
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
•
2412.14475
•
Published
•
46
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
•
2412.14233
•
Published
•
5