How do I introduce a new subject without extra training in Stable Diffusion?

#9760
by offchan - opened

Suppose I have a dataset of 1000 pokemon, I have 10 images of Pikachu, 10 images of Bulbasaur, etc.
I also have a metadata specifying the name of each pokemon exactly. So from the metadata, I can know which image is Pikachu and which image is not.
I want to fine-tune a Stable Diffusion model to draw pokemons with prompt like this "a drawing of [name]" where name is the name of the pokemon I want to draw. It should be fine to draw any Pokemon with well-known name in the dataset. I should probably even be able to draw Donald Trump in the style of pokemon because the base model already knows about Donald Trump.

The problem is when I want to draw a completely new-made up pokemon, the model doesn't know its name. Let's say my Pokemon is called "Megachu" which is basically a thick Pikachu with red body and wings.
I want to introduce the model to Megachu by drawing Megachu myself and show the image to the model somehow. There are common ways of doing this which are Dreambooth, Textual Inversion, DreamArtist, etc but they all require me to train the model which takes long time and is costly.

So what I want is to somehow feed the model Pokemon embedding vectors so that the model knows how to draw any pokemon based on its embedding instead of its name.
Given a new Pokemon like Megachu, I want to just run the Megachu image through an embedding extraction process and feed the embedding to the model so that it can draw my Megachu.
I think it should be roughly similar to face embedding training process.

I am very new to Stable Diffusion architecture in general. Please suggest me a way

  1. to train the embedding vectors for Pokemon from 1000 images (preferably using existing weights from Stable Diffusion to assist the process so that it can be accurate with low amount of data). Which model layer should I modify? Should I just represent this vector as tokens?
  2. to extract the Pokemon embedding from any image. Should the model that does the embedding be separated from the Diffusion model itself? Where should I inject the embedding into the Diffusion model?

I tried Stable Diffusion Variations and it doesn't preserve the character. For example, if I give it Megachu, it will change my pokemon's color, change the wing shape, or body thickness.

This is my answer after grokking stable diffusion architecture: Just apply extra conditioning similar to Stable Diffusion Image Variations but you might want to train CLIP image encoder further on your custom pokemon dataset by removing the text component and train the model with contrastive loss similar to face recognition models.

offchan changed discussion status to closed

Sign up or log in to comment