-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 26 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 43 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 22
Collections
Discover the best community collections!
Collections including paper arxiv:2405.18669
-
OpenBuddy/openbuddy-codellama2-34b-v11.1-bf16
Text Generation • Updated • 3.07k • 11 -
mistralai/Codestral-22B-v0.1
Text Generation • Updated • 10.7k • 1.23k -
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
Text Generation • Updated • 300k • • 406 -
Magpie-Align/Magpie-Qwen2.5-Coder-Pro-300K-v0.1
Viewer • Updated • 300k • 334 • 3
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 16 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 88 -
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 32
-
parler-tts/parler_tts_mini_v0.1
Text-to-Speech • Updated • 9.73k • 349 -
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Paper • 2405.08317 • Published • 13 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 12 -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • 2406.02430 • Published • 34