CompVis/stable-diffusion-v1-4 · A way of improving compositionality?

Hello everybody!

I've had the chance of fiddling a little bit with Stable Diffusion these days, and let me state the obvious: it is amazing!

However, I've read the Limitations section, and I couldn't help but take a look at the point that says, "The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”". Indeed, on all the composition scenes I tried to generate, the results weren't as good as the ones obtained with simpler scenes.

However, I think there might be a way of improving this. The thing is, this model is, as stated in the official repo, conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder.

The main issue for me is, I don't think this CLIP model is the most optimal way of training Stable Diffusion. You see, CLIP was originally trained for zero-shot image classification by learning to pair images with their descriptions, an approach that maybe is not the best one for text-to-image generation.

And I heavily suspect that we could get better compositionality results, by training/fine-tuning this CLIP model on top of a dataset more focused on complex images whose descriptions detail all the object relationships. Either this, or we would need to maybe modify this CLIP model for it to more accurately reflect scene composition.

TL;DR: I believe vanilla CLIP is not optimal for text-to-image generation since it was not trained for reflecting scene composition specifically. Either training/fine-tuning or modifying CLIP could be good approaches to tackle this problem.

If anybody thinks this is a good approach and wants to give it a go, I will be more than willing to give a helping hand. Unfortunately, we don't have that many resources in my laboratory (just around 5 GPUs...).

Cheers everybody! I hope my comments were helpful 😇