Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22 • 110
Wavelets Are All You Need for Autoregressive Image Generation Paper • 2406.19997 • Published Jun 28 • 28
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents Paper • 2408.07060 • Published Aug 13 • 39
Medical SAM 2: Segment medical images as video via Segment Anything Model 2 Paper • 2408.00874 • Published Aug 1 • 41
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference Paper • 2403.14520 • Published Mar 21 • 32
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 49
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation Paper • 2407.14931 • Published Jul 20 • 20
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild Paper • 2406.19380 • Published Jun 27 • 47
Make It Count: Text-to-Image Generation with an Accurate Number of Objects Paper • 2406.10210 • Published Jun 14 • 76
Aligning Diffusion Models with Noise-Conditioned Perception Paper • 2406.17636 • Published Jun 25 • 26
HARE: HumAn pRiors, a key to small language model Efficiency Paper • 2406.11410 • Published Jun 17 • 38
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing Paper • 2406.10601 • Published Jun 15 • 65
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials Paper • 2406.14347 • Published Jun 20 • 99
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Paper • 2311.06242 • Published Nov 10, 2023 • 79
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack Paper • 2406.10149 • Published Jun 14 • 48
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Paper • 2406.08973 • Published Jun 13 • 85
Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian Paper • 2405.13929 • Published May 22 • 51
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition Paper • 2401.11649 • Published Jan 22 • 3
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation Paper • 2404.11565 • Published Apr 17 • 14
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community Apr 15 • 161
SigLIP Collection Contrastive (sigmoid) image-text models from https://arxiv.org/abs/2303.15343 • 8 items • Updated Jul 31 • 34
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9 • 29
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models Paper • 2401.15947 • Published Jan 29 • 48
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24 • 44
Mamba: Linear-Time Sequence Modeling with Selective State Spaces Paper • 2312.00752 • Published Dec 1, 2023 • 138
DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 178
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline Paper • 2311.13073 • Published Nov 22, 2023 • 56