TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones Paper • 2312.16862 • Published Dec 28, 2023 • 28
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Paper • 2312.17172 • Published Dec 28, 2023 • 24
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers Paper • 2401.01974 • Published Jan 3 • 4
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations Paper • 2401.01885 • Published Jan 3 • 26
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Paper • 2401.01335 • Published Jan 2 • 61
Improving Text Embeddings with Large Language Models Paper • 2401.00368 • Published Dec 31, 2023 • 73
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk Paper • 2401.05033 • Published Jan 10 • 14
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding Paper • 2401.04575 • Published Jan 9 • 14
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Paper • 2401.04695 • Published Jan 9 • 8
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively Paper • 2401.02955 • Published Jan 5 • 16
Understanding LLMs: A Comprehensive Overview from Training to Inference Paper • 2401.02038 • Published Jan 4 • 59
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis Paper • 2401.17093 • Published Jan 30 • 18
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model Paper • 2401.16420 • Published Jan 29 • 54
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 46
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization Paper • 2401.15914 • Published Jan 29 • 7
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24 • 41
Small Language Model Meets with Reinforced Vision Vocabulary Paper • 2401.12503 • Published Jan 23 • 30
Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment Paper • 2401.12474 • Published Jan 23 • 33
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text Paper • 2401.12070 • Published Jan 22 • 41
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22 • 21
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 122
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 38
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference Paper • 2403.04132 • Published Mar 7 • 38
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models Paper • 2402.10986 • Published Feb 16 • 73
Linear Transformers with Learnable Kernel Functions are Better In-Context Models Paper • 2402.10644 • Published Feb 16 • 74
TravelPlanner: A Benchmark for Real-World Planning with Language Agents Paper • 2402.01622 • Published Feb 2 • 30
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement Paper • 2403.15042 • Published Mar 22 • 24
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 41
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11 • 28
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published Apr 10 • 14
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation Paper • 2404.14396 • Published Apr 22 • 17
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation Paper • 2404.13026 • Published Apr 19 • 21
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models Paper • 2404.12387 • Published Apr 18 • 35
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 23
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation Paper • 2404.19752 • Published Apr 30 • 19
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension Paper • 2404.16790 • Published Apr 25 • 7
Many-Shot In-Context Learning in Multimodal Foundation Models Paper • 2405.09798 • Published 17 days ago • 25