VLFM - a danielz01 Collection

danielz01 's Collections

VLFM

Agents

Image Generation

ViT

VLFM

updated May 30

Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 50
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Paper • 2311.05698 • Published Nov 9, 2023 • 9
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 86
PolyMaX: General Dense Prediction with Mask Transformer

Paper • 2311.05770 • Published Nov 9, 2023 • 6
Learning Vision from Models Rivals Learning Vision from Data

Paper • 2312.17742 • Published Dec 28, 2023 • 15
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Paper • 2401.12168 • Published Jan 22 • 26
A Survey of Resource-efficient LLM and Multimodal Foundation Models

Paper • 2401.08092 • Published Jan 16 • 3
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

Paper • 2401.15071 • Published Jan 26 • 35
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

Paper • 2401.15914 • Published Jan 29 • 7
MouSi: Poly-Visual-Expert Vision-Language Models

Paper • 2401.17221 • Published Jan 30 • 8
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

Paper • 2401.17093 • Published Jan 30 • 19
DataComp: In search of the next generation of multimodal datasets

Paper • 2304.14108 • Published Apr 27, 2023 • 2
Question Aware Vision Transformer for Multimodal Reasoning

Paper • 2402.05472 • Published Feb 8 • 8
Demystifying CLIP Data

Paper • 2309.16671 • Published Sep 28, 2023 • 20
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19 • 41
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Paper • 2402.15021 • Published Feb 22 • 12
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8 • 39
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3 • 14
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19 • 30
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 87
Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22 • 22