-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 22 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 10 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 33 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2405.15738
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 138 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 10 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 47 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 42
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Paper • 2406.06525 • Published • 62 -
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 22 -
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Paper • 2406.04271 • Published • 27 -
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper • 2406.02657 • Published • 36
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 12 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 4
-
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 49 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 15 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 25 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 117
-
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Paper • 2403.00483 • Published • 9 -
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
Paper • 2403.01779 • Published • 26 -
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Paper • 2401.11605 • Published • 19 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 48
-
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 36 -
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 78 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 92 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 48