LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published 1 day ago • 16
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published 1 day ago • 42
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Paper • 2504.13157 • Published 6 days ago • 19
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking Paper • 2504.09228 • Published 12 days ago • 4
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models Paper • 2504.11468 • Published 14 days ago • 26
BlockGaussian: Efficient Large-Scale Scene Novel View Synthesis via Adaptive Block-Based Gaussian Splatting Paper • 2504.09048 • Published 12 days ago • 7
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer Paper • 2504.10462 • Published 9 days ago • 15
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published 9 days ago • 239
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Paper • 2504.08791 • Published 17 days ago • 123
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper • 2504.07615 • Published 14 days ago • 30
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Paper • 2504.08685 • Published 13 days ago • 121
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published 22 days ago • 82
WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments Paper • 2504.03886 • Published 19 days ago • 10
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published 16 days ago • 170