Multimodal - a minlik Collection

minlik 's Collections

LLM

IE

other

Multimodal

updated Sep 5, 2024

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Paper • 2306.17107 • Published Jun 29, 2023 • 11
On the Hidden Mystery of OCR in Large Multimodal Models

Paper • 2305.07895 • Published May 13, 2023 • 1
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 8
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 51
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Paper • 2311.11810 • Published Nov 20, 2023 • 1
OCR-free Document Understanding Transformer

Paper • 2111.15664 • Published Nov 30, 2021 • 3
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Paper • 2210.03347 • Published Oct 7, 2022 • 3
Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Paper • 2310.11016 • Published Oct 17, 2023
Nougat: Neural Optical Understanding for Academic Documents

Paper • 2308.13418 • Published Aug 25, 2023 • 38
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1, 2024 • 46
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 76
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 43
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Paper • 2404.05225 • Published Apr 8, 2024 • 1
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

Paper • 2403.14252 • Published Mar 21, 2024
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 46
CogVLM: Visual Expert for Pretrained Language Models

Paper • 2311.03079 • Published Nov 6, 2023 • 26
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 36
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 38
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 54
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4, 2024 • 26
LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 95
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57