arxiv:2306.00637

Wuerstchen: Efficient Pretraining of Text-to-Image Models

Published on Jun 1, 2023

· Submitted by

akhaliq on Jun 2, 2023

#3 Paper of the day

Upvote

Authors:

Pablo Pernias ,

Dominic Rampas ,

Marc Aubreville

Abstract

We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.

View arXiv page View PDF Add to collection

Community

patrickvonplaten

Jun 2, 2023

@dome272 in Figure 6.:

you are showing inference times for different batch sizes. Two questions:

1.) Which hardware did you use (GPU / CPU)?
2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?

patrickvonplaten

Jun 2, 2023

Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast

babbleberns

Paper author Jun 2, 2023

Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast

We are running some experiments specifically focused on trying to reduce the required number of sampling steps. We have already improved the speed of stage B (upsampler) quite a bit, and we're trying to see if the same approach could help reduce the number of sampling steps of stage C (text2img prior) 🤞

dome272

Paper author Jun 7, 2023

@dome272 in Figure 6.:

you are showing inference times for different batch sizes. Two questions:

1.) Which hardware did you use (GPU / CPU)?

2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?

Hey Patrick,

Its A100
The speed is similar to SD, but there is probably a lot to be optimized that can make this model extremely fast!
We are working on it!

sayakpaul

Sep 7, 2023

An immediate way to that could be the use of torch.compile() and token merging. I know the latter might lead to visual quality degradation (but a smaller token ratio doesn't hurt much).