Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Abstract
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.
Community
CrossFlow is a versatile and efficient framework that establishes direct mappings between two modalities using standard flow matching, without requiring additional conditioning.
Using a vanilla transformer, it achieves state-of-the-art performance across diverse tasks—including T2I, depth estimation, and image captioning—without relying on cross-attention or task-specific architectures.
webpage and code: https://cross-flow.github.io/
Good work! I am wondering since you are using an image vae decoder, maybe the performance improvement of VE comes from distribution similarity? Since both text and image embedding are regularized by similar kl divergence.
So if you change the decoder from VAE decoder to something else, or even VAE with a different prior, do you expect VE to be important for performance?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution (2024)
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models (2024)
- Conditional Text-to-Image Generation with Reference Guidance (2024)
- CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis (2024)
- ZoomLDM: Latent Diffusion Model for multi-scale image generation (2024)
- The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation (2024)
- Self-Guidance: Boosting Flow and Diffusion Generation on Their Own (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper