Abstract
aMUSEd, a masked image model with 10% of MUSE's parameters, generates images quickly and efficiently, requiring fewer inference steps and offering easier fine-tuning compared to latent diffusion.
We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.
Community
Models: https://hf.co/amused
An old man was standing on an old bridge. He remembered how he had dived when he was young. He wanted to dive, but he didn't dare
Meet aMUSEd: A Lightweight Revolution in Text-to-Image Generation
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Get this paper in your agent:
hf papers read 2401.01808 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
