InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 16
Meta Llama 3 Collection This collection hosts the transformers and original repos of the Meta Llama 3 and Llama Guard 2 releases • 5 items • Updated Apr 18 • 536
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 41
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 11
Qwen1.5 Collection Qwen1.5 is the improved version of Qwen, the large language model series developed by Alibaba Cloud. • 55 items • Updated 12 days ago • 173
Controllable Human-Object Interaction Synthesis Paper • 2312.03913 • Published Dec 6, 2023 • 22
Dolphins: Multimodal Language Model for Driving Paper • 2312.00438 • Published Dec 1, 2023 • 12
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Paper • 2311.18775 • Published Nov 30, 2023 • 6
Doppelgangers: Learning to Disambiguate Images of Similar Structures Paper • 2309.02420 • Published Sep 5, 2023 • 9
Emergence of Segmentation with Minimalistic White-Box Transformers Paper • 2308.16271 • Published Aug 30, 2023 • 13
Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation Paper • 2303.16456 • Published Mar 29, 2023 • 1
StableVideo: Text-driven Consistency-aware Diffusion Video Editing Paper • 2308.09592 • Published Aug 18, 2023 • 2
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Paper • 2307.16449 • Published Jul 31, 2023 • 14
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Paper • 2308.01907 • Published Aug 3, 2023 • 10
To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation Paper • 2307.15063 • Published Jul 27, 2023 • 15
DreamTeacher: Pretraining Image Backbones with Deep Generative Models Paper • 2307.07487 • Published Jul 14, 2023 • 19