AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models Paper • 2406.10900 • Published 15 days ago • 5
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding Paper • 2406.19389 • Published 4 days ago • 47
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published 11 days ago • 75
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published 7 days ago • 48
VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper • 2406.12275 • Published 13 days ago • 28
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published 18 days ago • 47
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper • 2406.07476 • Published 20 days ago • 30
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions Paper • 2406.04325 • Published 25 days ago • 69
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published May 24 • 43
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation Paper • 2404.19752 • Published Apr 30 • 20
Better & Faster Large Language Models via Multi-token Prediction Paper • 2404.19737 • Published Apr 30 • 65
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 33
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding Paper • 2404.16710 • Published Apr 25 • 56
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework Paper • 2404.14619 • Published Apr 22 • 124
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation Paper • 2404.14396 • Published Apr 22 • 17
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 240
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12 • 62
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models Paper • 2404.07973 • Published Apr 11 • 28
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Paper • 2404.07143 • Published Apr 10 • 97
RULER: What's the Real Context Size of Your Long-Context Language Models? Paper • 2404.06654 • Published Apr 9 • 32
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published Apr 10 • 14
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4 • 27
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published Apr 4 • 22
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction Paper • 2404.02905 • Published Apr 3 • 61
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published Apr 2 • 102
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models Paper • 2404.01367 • Published Apr 1 • 19
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 13
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 123
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings Paper • 2403.07750 • Published Mar 12 • 19
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper • 2403.06764 • Published Mar 11 • 24
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 39
Data Engineering for Scaling Language Models to 128K Context Paper • 2402.10171 • Published Feb 15 • 18
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
Sora Reference Papers Collection A collection of all papers referenced in OpenAI's "Video generation models as world simulators" technical report • openai.com/sora • 30 items • Updated Feb 20 • 51
Animated Stickers: Bringing Stickers to Life with Video Diffusion Paper • 2402.06088 • Published Feb 8 • 9
Deconstructing Denoising Diffusion Models for Self-Supervised Learning Paper • 2401.14404 • Published Jan 25 • 16
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models Paper • 2401.06066 • Published Jan 11 • 36
Analyzing and Improving the Training Dynamics of Diffusion Models Paper • 2312.02696 • Published Dec 5, 2023 • 31
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Paper • 2312.08578 • Published Dec 14, 2023 • 15
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models Paper • 2312.13964 • Published Dec 21, 2023 • 16
Diffusion Model Alignment Using Direct Preference Optimization Paper • 2311.12908 • Published Nov 21, 2023 • 47
CodeFusion: A Pre-trained Diffusion Model for Code Generation Paper • 2310.17680 • Published Oct 26, 2023 • 68
Octopus: Embodied Vision-Language Programmer from Environmental Feedback Paper • 2310.08588 • Published Oct 12, 2023 • 32
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion Paper • 2310.03502 • Published Oct 5, 2023 • 75
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models Paper • 2309.05793 • Published Sep 11, 2023 • 50