Vision and language - a moegi161 Collection

moegi161 's Collections

Vision and language

3D

Vision and language

updated Jun 5

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Paper • 2404.04125 • Published Apr 4 • 27
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4 • 33
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

Paper • 2404.02747 • Published Apr 3 • 11
3D Congealing: 3D-Aware Image Alignment in the Wild

Paper • 2404.02125 • Published Apr 2 • 7
BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Paper • 2404.04544 • Published Apr 6 • 20
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Paper • 2404.07987 • Published Apr 11 • 47
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 18
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Paper • 2404.07199 • Published Apr 10 • 25
Learning to Route Among Specialized Experts for Zero-Shot Generalization

Paper • 2402.05859 • Published Feb 8 • 5
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

Paper • 2403.00587 • Published Mar 1
ReGround: Improving Textual and Spatial Grounding at No Cost

Paper • 2403.13589 • Published Mar 20
FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Paper • 2403.12026 • Published Mar 18 • 1
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5 • 60
Editable Image Elements for Controllable Synthesis

Paper • 2404.16029 • Published Apr 24 • 10
Move Anything with Layered Scene Diffusion

Paper • 2404.07178 • Published Apr 10
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Paper • 2405.21048 • Published May 31 • 12