Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs Paper • 2504.00072 • Published 17 days ago • 7
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published Feb 27 • 30
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Paper • 2403.10517 • Published Mar 15, 2024 • 36
CoVR: Learning Composed Video Retrieval from Web Video Captions Paper • 2308.14746 • Published Aug 28, 2023 • 2
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives Paper • 2307.05473 • Published Jul 11, 2023 • 13