DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion Paper • 2309.12424 • Published Sep 21, 2023 • 11
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning Paper • 2309.15091 • Published Sep 26, 2023 • 32
End-to-End Speech Recognition Contextualization with Large Language Models Paper • 2309.10917 • Published Sep 19, 2023 • 9
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Paper • 2307.06942 • Published Jul 13, 2023 • 22