Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.07204

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 39
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 20

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17 • 72
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27 • 44
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25 • 103

vision foundation modesl

vision foundation models

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 52
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24 • 26
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126

CodecLM: Aligning Language Models with Tailored Synthetic Data

Paper • 2404.05875 • Published Apr 8 • 16
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Paper • 2404.07987 • Published Apr 11 • 47

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Paper • 2404.05961 • Published Apr 9 • 64
OmniFusion Technical Report

Paper • 2404.06212 • Published Apr 9 • 74
Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10 • 17
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18

Vision and language

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Paper • 2404.04125 • Published Apr 4 • 27
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4 • 33
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

Paper • 2404.02747 • Published Apr 3 • 11
3D Congealing: 3D-Aware Image Alignment in the Wild

Paper • 2404.02125 • Published Apr 2 • 7

Tiny VLM Decoder

BAAI/Bunny-v1_0-3B

Text Generation • Updated 23 days ago • 2.64k • 38
Tensoic/Cerule-v0.1

Image-Text-to-Text • Updated Apr 5 • 31 • 46
deepseek-ai/deepseek-vl-1.3b-base

Updated Mar 15 • 230 • 41
mtgv/MobileVLM_V2-1.7B

Text Generation • Updated Feb 7 • 8.32k • 24

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Paper • 2312.16862 • Published Dec 28, 2023 • 30
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Paper • 2312.17172 • Published Dec 28, 2023 • 26
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Paper • 2401.01974 • Published Jan 3 • 5
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Paper • 2401.01885 • Published Jan 3 • 27

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs