Expanding on the synthetic inpainting training dataset

#6
by Wakeme - opened

Hello, I like the model but I think it still missed a few things. First of all is the color, which were discussed here, and I am not sure what caused the color to drift. Also I've found so far that SD1.5, SD2, and SDXL inpainting can works on a region with enough surrounding inductive bias, but is very hard when I want to inpaint something completely new (I have to resort to guide with another ControlNet). This doesn't seems to happen with what Photoshop generative fill has to offer (which is odd to me)

So I have a hypothesis - based on your synthetic data example, the mask is overlayed on top of the original image, and the model then need to fill in the details. However, I think it would greatly improved if the dataset was superimposed by a transparent subject. To clarify, by using your method only, it instill a bias. For example, if a model only ever see a cow on a green grass, and an empty grass field. It might be able to in-paint a cow on an empty grass field, or remove the cow from the scene. However, if you ask to replace a cow with a fish, it might not be able to do it. This is because it associate the green background with cow, and not fish. So it is much harder to guide the image

So I proposed a simple synthetic data, assume that we have a bunch of cut-out random subject, we can then superimposed them on top of the image. For example, if I have the following image

Text Image
Background beautiful cozy oasis with a view overlooking a cyberpunk oasis at night, foggy outside, detailed aesthetic, high resolution, photorealistic, dark, gloomy and moody (Ref) background
Foreground Muted portrait photography from the waist up titled "male possum in a forest", muted palette, soft colors, detailed, 8k (Ref) foreground
Superimposed beautiful cozy oasis with a view overlooking a cyberpunk oasis at night, foggy outside, detailed aesthetic, high resolution, photorealistic, dark, gloomy and moody in the background with, Muted portrait photography from the waist up titled "male possum in a forest", muted palette, soft colors, detailed, 8k in front superimposed
Mask Dito... mask

For low effort super composed, a foreground object can be directly grabbed from something like COCO dataset, or other semantic / panoptic segmentation dataset

Interesting idea but I am afraid that just pasting image on another might not give amazing result if the model learn to do exactly the same.
Another thing about the color issue, I don't think it is relative to the prompt as when training the controlnet we do it with samples with and without prompt. I made several tests and even we 80% of the dataset used without prompt the colors are not more stable. The tiles controlnet suffers from the same issues (see https://huggingface.co/TTPlanet/TTPLanet_SDXL_Controlnet_Tile_Realistic, being able to change the colors is presented as a feature :-) ).
I made a V2 (uploaded it in the v2 folder) it has been trained on much more data, real photo and synthetic data, much more training steps and the results is always the same, the colors fidelity is never fully respected (tried to enforce it by changing loss function and other things without success).
I think the colors issues are directly related to controlnet architecture and we cannot expect them to work this way.
One thing interesting for future experiments is starting the training from a tiles controlnet and not from scratch, it is much more efficient as the controlnet already know how to transfer the image from controlnet input to output and just has to learn the inpainting part. Learning starts to show results after a few hundred steps against a few thousands when learning from scratch.

Sign up or log in to comment