Submitted by guyuchao 59 Long-Context Autoregressive Video Modeling with Next-Frame Prediction · 3 authors 2
Submitted by HongchengGao 26 Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation · 9 authors 4
Submitted by Row11n 26 CoMP: Continual Multimodal Pre-training for Vision Foundation Models · 5 authors 1
Submitted by phillipinseoul 25 Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing · 4 authors 4
Submitted by zichenwen 15 Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation · 10 authors 3
Submitted by richardxp888 13 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding · 7 authors 2
Submitted by akhaliq 12 Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking · 8 authors 2
Submitted by akhaliq 8 ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning · 12 authors 2
Submitted by 3587jjh 6 Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models · 4 authors 1
Submitted by akhaliq 5 WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation · 8 authors 2
Submitted by Ningyu 4 LookAhead Tuning: Safer Language Models via Partial Answer Previews · 10 authors 2
Submitted by pranamanam 4 Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation · 4 authors 2
Submitted by BestWishYsh 3 FullDiT: Multi-Task Video Generative Foundation Model with Full Attention · 9 authors 2
Submitted by qth 3 Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation · 9 authors 2
Submitted by haoyuhsu 3 PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos · 6 authors 2
Submitted by wish44165 3 Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID · 1 authors 3
Submitted by zhehuderek 3 When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making · 3 authors 2
Submitted by DmitryRyumin 2 FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images · 13 authors 2
Submitted by lx865712528 2 Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling · 4 authors 2
Submitted by CharlesChen2023 2 Frequency Dynamic Convolution for Dense Image Prediction · 5 authors 2
Submitted by mwmathis 2 LLaVAction: evaluating and training multi-modal large language models for action recognition · 4 authors 2
Submitted by wangyi111 2 Towards a Unified Copernicus Foundation Model for Earth Vision · 11 authors 2
Submitted by stojnvla 1 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation · 4 authors 2
Submitted by rishitdagli 1 Can Vision-Language Models Answer Face to Face Questions in the Real-World? · 6 authors 2
Submitted by ikodoh 1 ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models · 7 authors 1
Submitted by akhaliq 1 FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement · 7 authors 2
Submitted by yaraalaa0 - Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images · 2 authors 2
Submitted by LUC1O - OpenCity3D: What do Vision-Language Models know about Urban Environments? · 5 authors 2
Submitted by gym890 - DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis · 7 authors 2