vlms - a cooleel Collection

Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

cooleel 's Collections

RL

general

LLMs

Agent

vlms

DocAI

vlms

updated 15 days ago

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21, 2024 • 44
AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 60
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 31
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

Paper • 2410.01744 • Published Oct 2, 2024 • 26
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Paper • 2410.14059 • Published Oct 17, 2024 • 60
NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 53
Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 94
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 111
Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 24
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published Nov 6, 2024 • 48
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Paper • 2410.02367 • Published Oct 3, 2024 • 48
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Paper • 2410.05160 • Published Oct 7, 2024 • 4
Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 65
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 55
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Paper • 2410.17247 • Published Oct 22, 2024 • 47
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 55
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Paper • 2411.10669 • Published Nov 16, 2024 • 10
Autoregressive Models in Vision: A Survey

Paper • 2411.05902 • Published Nov 8, 2024 • 18
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 39
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 148
Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 54
POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published Dec 16, 2024 • 57
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 13
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15 • 31
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Paper • 2412.07626 • Published Dec 10, 2024 • 22
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Paper • 2412.09585 • Published Dec 12, 2024 • 11
Smaller Language Models Are Better Instruction Evolvers

Paper • 2412.11231 • Published Dec 15, 2024 • 28
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13 • 41
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14 • 33
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Paper • 2502.17422 • Published Feb 24 • 7
Introducing Visual Perception Token into Multimodal Large Language Model

Paper • 2502.17425 • Published Feb 24 • 14
KV-Edit: Training-Free Image Editing for Precise Background Preservation

Paper • 2502.17363 • Published Feb 24 • 35
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Paper • 2502.18411 • Published about 1 month ago • 71
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge

Paper • 2502.19870 • Published 29 days ago • 4
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Paper • 2502.20172 • Published 29 days ago • 28
UniTok: A Unified Tokenizer for Visual Generation and Understanding

Paper • 2502.20321 • Published 29 days ago • 29
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Paper • 2502.16033 • Published Feb 22 • 17
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Paper • 2502.12084 • Published Feb 17 • 29
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20 • 138
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Paper • 2502.14834 • Published Feb 20 • 24
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models

Paper • 2502.14191 • Published Feb 20 • 7
Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11 • 29
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 175
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18 • 57
YOLOv12: Attention-Centric Real-Time Object Detectors

Paper • 2502.12524 • Published Feb 18 • 10
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Paper • 2502.08826 • Published Feb 12 • 17
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Paper • 2502.20395 • Published 29 days ago • 45
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Paper • 2502.19634 • Published 29 days ago • 62
MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

Paper • 2502.21291 • Published 28 days ago • 4
Tell me why: Visual foundation models as self-explainable classifiers

Paper • 2502.19577 • Published 30 days ago • 10
ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Paper • 2503.00329 • Published 27 days ago • 18
Should VLMs be Pre-trained with Image Data?

Paper • 2503.07603 • Published 18 days ago • 3
DiffCLIP: Differential Attention Meets CLIP

Paper • 2503.06626 • Published 19 days ago • 4

Collection guide
Browse collections

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs