Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Paper • 2412.00493 • Published 18 days ago • 16
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 6 days ago • 55
SCBench: A KV Cache-Centric Analysis of Long-Context Methods Paper • 2412.10319 • Published 5 days ago • 8
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 5 days ago • 121
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Paper • 2412.03069 • Published 15 days ago • 30
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published 13 days ago • 103
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases Paper • 2412.04862 • Published 13 days ago • 46
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Paper • 2405.21060 • Published May 31 • 63
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning Paper • 2405.12130 • Published May 20 • 46
FIFO-Diffusion: Generating Infinite Videos from Text without Training Paper • 2405.11473 • Published May 19 • 53