Kevin16
's Collections
Video Understanding
updated
Vript: A Video Is Worth Thousands of Words
Paper
•
2406.06040
•
Published
•
25
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
72
MMLU-Pro: A More Robust and Challenging Multi-Task Language
Understanding Benchmark
Paper
•
2406.01574
•
Published
•
43
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
•
2405.21075
•
Published
•
20
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
•
2405.20340
•
Published
•
19
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
30
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
•
2405.14129
•
Published
•
12
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
•
2402.13250
•
Published
•
25
A Simple LLM Framework for Long-Range Video Question-Answering
Paper
•
2312.17235
•
Published
Retrieval-Augmented Egocentric Video Captioning
Paper
•
2401.00789
•
Published
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
15
World Model on Million-Length Video And Language With RingAttention
Paper
•
2402.08268
•
Published
•
37
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
•
2406.16338
•
Published
•
25
Long Context Transfer from Language to Vision
Paper
•
2406.16852
•
Published
•
32
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
59
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
40
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
Understanding
Paper
•
2407.15754
•
Published
•
20
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
•
2409.07239
•
Published
•
11
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
•
2412.09596
•
Published
•
92