VLM - a kevin1020 Collection

kevin1020 's Collections

RAG

Inference Acceleration

Code Generation

Efficient Tuning

Token Compression

Efficient VLM via Image Token Compression

VLM

PEFT

ViT

Modular

VLM

updated Oct 16

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Paper • 2403.12596 • Published Mar 19 • 9
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19 • 30
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25 • 35
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Paper • 2405.14129 • Published May 23 • 12
Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22 • 22
Merlin:Empowering Multimodal LLMs with Foresight Minds

Paper • 2312.00589 • Published Nov 30, 2023 • 24
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Paper • 2407.15754 • Published Jul 22 • 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22 • 40
Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25 • 16
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2 • 27
LongVLM: Efficient Long Video Understanding via Large Language Models

Paper • 2404.03384 • Published Apr 4
Visual Context Window Extension: A New Perspective for Long Video Understanding

Paper • 2409.20018 • Published Sep 30 • 9
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14 • 24