Stylus: Automatic Adapter Selection for Diffusion Models Paper • 2404.18928 • Published 3 days ago • 11
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Paper • 2404.16821 • Published 7 days ago • 47
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Paper • 2404.13208 • Published 13 days ago • 36
SpaceByte: Towards Deleting Tokenization from Large Language Modeling Paper • 2404.14408 • Published 10 days ago • 6
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation Paper • 2404.14396 • Published 10 days ago • 16
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published 10 days ago • 224
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models Paper • 2404.09204 • Published 18 days ago • 10
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation Paper • 2404.08540 • Published 20 days ago • 10
BRAVE: Broadening the visual encoding of vision-language models Paper • 2404.07204 • Published 22 days ago • 14
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published 21 days ago • 39
Audio Dialogues: Dialogues dataset for audio and music understanding Paper • 2404.07616 • Published 21 days ago • 14
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published 23 days ago • 29
Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models Paper • 2404.06209 • Published 23 days ago • 4
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Paper • 2404.05726 • Published 24 days ago • 18
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Paper • 2404.05719 • Published 24 days ago • 55
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Paper • 2404.03413 • Published 28 days ago • 21
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models Paper • 2404.02258 • Published 30 days ago • 98
Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs Paper • 2403.20041 • Published Mar 29 • 33
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction Paper • 2403.18795 • Published Mar 27 • 16
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs Paper • 2403.17607 • Published Mar 26 • 7
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21 • 31
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework Paper • 2403.13248 • Published Mar 20 • 71
GiT: Towards Generalist Vision Transformer through Universal Language Interface Paper • 2403.09394 • Published Mar 14 • 24
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Paper • 2403.12895 • Published Mar 19 • 27
FeatUp: A Model-Agnostic Framework for Features at Any Resolution Paper • 2403.10516 • Published Mar 15 • 15
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Paper • 2403.09611 • Published Mar 14 • 119
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images Paper • 2403.11703 • Published Mar 18 • 13
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring Paper • 2403.09333 • Published Mar 14 • 14
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba Paper • 2403.09977 • Published Mar 15 • 7
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents Paper • 2403.08715 • Published Mar 13 • 19
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Paper • 2403.06764 • Published Mar 11 • 24
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding Paper • 2403.01487 • Published Mar 3 • 14
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 560
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases Paper • 2402.14905 • Published Feb 22 • 79
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models Paper • 2402.10986 • Published Feb 16 • 73
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Paper • 2402.10896 • Published Feb 16 • 13
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models Paper • 2402.07033 • Published Feb 10 • 16
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models Paper • 2402.05935 • Published Feb 8 • 11
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss Paper • 2402.05008 • Published Feb 7 • 18
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models Paper • 2402.03749 • Published Feb 6 • 9
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models Paper • 2402.01739 • Published Jan 29 • 26