new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

Daily Papers

byAK and the research community

Mar 26

Submitted by

guyuchao

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

·
3 authors

Submitted by

HongchengGao

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

·
9 authors

Submitted by

Row11n

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

·
5 authors

1

Submitted by

phillipinseoul

Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

·
4 authors

Submitted by

bfshi

Scaling Vision Pre-Training to 4K Resolution

·
11 authors

2

Submitted by

zichenwen

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

·
10 authors

Submitted by

richardxp888

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

·
7 authors

2

Submitted by

akhaliq

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

·
8 authors

Submitted by

chuonghm

CoLLM: A Large Language Model for Composed Image Retrieval

·
8 authors

2

Submitted by

akhaliq

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

·
12 authors

Submitted by

3587jjh

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

·
4 authors

Submitted by

akhaliq

WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

·
8 authors

Submitted by

Ningyu

LookAhead Tuning: Safer Language Models via Partial Answer Previews

·
10 authors

Submitted by

pranamanam

Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

·
4 authors

2

Submitted by

BestWishYsh

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

·
9 authors

Submitted by

qth

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

·
9 authors

2

Submitted by

shadowpa0327

xKV: Cross-Layer SVD for KV-Cache Compression

·
7 authors

Submitted by

haoyuhsu

PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos

·
6 authors

Submitted by

wish44165

Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID

·
1 authors

3

Submitted by

zhehuderek

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

·
3 authors

Submitted by

tuvu

Efficient Model Development through Fine-tuning Transfer

·
5 authors

Submitted by

DmitryRyumin

FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

·
13 authors

Submitted by

lx865712528

Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

·
4 authors

2

Submitted by

CharlesChen2023

Frequency Dynamic Convolution for Dense Image Prediction

·
5 authors

2

Submitted by

mwmathis

LLaVAction: evaluating and training multi-modal large language models for action recognition

·
4 authors

2

Submitted by

wangyi111

Towards a Unified Copernicus Foundation Model for Earth Vision

·
11 authors

2

Submitted by

stojnvla

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

·
4 authors

Submitted by

rishitdagli

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

·
6 authors

2

Submitted by

ikodoh

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

·
7 authors

Submitted by

taeyeop

Any6D: Model-free 6D Pose Estimation of Novel Objects

·
6 authors

2

Submitted by

akhaliq

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

·
7 authors

Submitted by

yaraalaa0

Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

·
2 authors

2

Submitted by

LUC1O

OpenCity3D: What do Vision-Language Models know about Urban Environments?

·
5 authors

2

Submitted by

gym890

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

·
7 authors

2