SD 1.5 Big G (alpha)

This is a Stable Diffusion 1.5 model, but it uses the CLIP Big G text encoder instead of the original CLIP-L text encoder. This is just a knowledge transfer pre-train with the goal of preserving the current knowledge of the model. It was only trained using student/teacher training from my SD 1.5 fine tune, Objective Reality v2. To fully realize the full potential of the much larger text encoder, it would need to be further fine tuned on a large dataset.

Examples

Coming soon

Usage

For diffusers, you can use it like any other stable diffusion model.

from diffusers import StableDiffusionPipeline
import torch

model_id = "ostris/sd15-big-g-alpha"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

It will not work out of the box with Comfy UI or Auto1111. There would need to be special code to load it. If there is any interest in this model, I may work on compatibility. Overall, it won't be hard to add. The only architecture change is the text encoder the and cross attention weights.

Alpha

This is just a pretrained alpha. There are some concepts that did not seem to transfer. It really needs proper training on a large dataset. Anyone is welcome to take this task on. I do not plan to at the time.

Why make this?

In the words of George Mallory, "Because it's there"

Training Method

As mentioned above, it was trained using student/teacher only. This was an iterative process over the corse of a few months, and I did not keep track of all of the exact numbers. The following are best estimates.

The cross attention layers were trained for 1-2 million steps with a batch size of 8 on a single 4090 GPU. Then the full unet was trained for around 100k steps with the same settings.