New v5.1 model

The new version is trained from a basis of the RunwayML 1.5 ckpt. This fine tuning sheds the last remnant of the concepts in original DreamBooth paper as regularization via generated images is dropped in favor of a mix a scrape of laion to protect the model's original qualities instead. 1636 training images, 1636 ground truth images from laion were trained for 19009 steps at LR 4e-7.

Results here (warning, huge image files)

general model test

new characters test

There is some remaining impact to cartoon character, but there is little "bleed" of the video game context into non-video game subjects. There are also a number of images that show improved cropping behavior even from the base Runway 1.5 file, which I attribute to careful cropping of both training and the ground truth images scraped from laion.

Prior info on 4.1 model

This is a finetuning of the compvis stable diffusion 1.4 ckpt. https://huggingface.co/CompVis/stable-diffusion

As an extension to the concept of "dreambooth" training, this fine tuning includes over a dozen concepts trained in over 1400 images with individual captions on each image.

Data was collected from screenshots in game from Final Fantasy 7 Remake. Annotations were created by using BLIP interrogation, then replacing the generic pronouns such as "a man" with "cloud strife" and so forth as appropriate. Further, rather painstaking, efforts were made to hand-tune the annotations to approxiately reflect the new concepts that may be included in any given image, such as "in the the slums district of midgar city" or "holding a buster sword" which cannot be detected by BLIP as they are unknown.

Data set includes over 1400 images, of which 120-140 each for the main characters (cloud, tifa, barret, aerith), 90 for jessie rasberry, smaller numbers for side characters, 98 for buildings/scenerio/indoor locations, 119 with 2+ characters in frame, and a smaller "grab bag" of objects like cloud's buster sword or food trucks in the game.

The novelty here is moving past dreambooth or textual inversion which are generally limited to a single "class".

Example new classes are "cloud strife", "tifa lockhart", "aerith gainsborough", "the streets of midgar city business district", along with many other characters and objects in the game.

Training was performed using Kane Wallmann's fork of the original Xavier Xiao which uses the filenames of training images as the caption on an individual basis, rather than a blanket "class" word.

https://github.com/kanewallmann/Dreambooth-Stable-Diffusion

Regularization sets such as "man" and "woman" were used for the male and female characters, and additional regularization sets were created for cities, buildings, and groups of people.

Training set includes screenshots of groups of characters, and compared to prior attempts these additional group images improve the ability to create group images at inference time. An example training image would be "cloud strife and barret wallace standing in a garden with a waterfall in the background". The propensity for inference to generate mixed or muddied characters is lowered, and exchanges of clothing and hair color is reduced.

The model does not turn everything into a final fantasy video game. "photo of tom cruise" does not look like a video game, or like any character in the training set, but, if you generate "tom cruise on the rooftops of the midgar city slums" it will generate a video-game render like image. Likewise, "photo of wall street, nyc" will not look like a video game unless also prompted as "photo of wall street, nyc, in the style of midgar city." While there is visible "damage" to the original model, it has lost minimal subjective qualities, unlike other fine tuned models, for instance, those tuned very extensively to produce "anime style" images that lose the ability to generate content you would expect from the original source 1.4 checkpoint.

This is a step away from textual inversion and dreambooth and towards full fine tuning. Further research will be undergone to replace regularization with images sourced from the original laion dataset.

Supplementary information here: https://gist.github.com/victorchall/67bc53472f86641aef1ebee1e154f5d1

My "group" regularization set here: https://github.com/victorchall/dreambooth-group-regularization

The group set is meant to mirror the sort of content in the group training set, and was generated with a few different captions.