Papers
arxiv:2306.00637

Wuerstchen: Efficient Pretraining of Text-to-Image Models

Published on Jun 1, 2023
Β· Featured in Daily Papers on Jun 2, 2023

Abstract

We introduce Wuerstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.

Community

@dome272 in Figure 6.:
Screenshot from 2023-06-02 09-50-59.png

you are showing inference times for different batch sizes. Two questions:

  • 1.) Which hardware did you use (GPU / CPU)?
  • 2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?

Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast

Paper author

Also do we really need 60 sampling steps for the prior? If we could get this down to something like 20, this model would be super fast

We are running some experiments specifically focused on trying to reduce the required number of sampling steps. We have already improved the speed of stage B (upsampler) quite a bit, and we're trying to see if the same approach could help reduce the number of sampling steps of stage C (text2img prior) 🀞

Paper author

@dome272 in Figure 6.:
Screenshot from 2023-06-02 09-50-59.png

you are showing inference times for different batch sizes. Two questions:

  • 1.) Which hardware did you use (GPU / CPU)?
  • 2.) How does this compare to SD? How much faster/slower is Wuerstchen compared to SD?

Hey Patrick,

  1. Its A100
  2. The speed is similar to SD, but there is probably a lot to be optimized that can make this model extremely fast!
    We are working on it!

An immediate way to that could be the use of torch.compile() and token merging. I know the latter might lead to visual quality degradation (but a smaller token ratio doesn't hurt much).

Brilliant idea!

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.00637 in a dataset README.md to link it from this page.

Spaces citing this paper 30

Collections including this paper 5