## Stable Diffusion Pipeline This is probably the best end-to-end semi-technical article: And a detailed look at diffusion process: But this is a short look at the pipeline: 1. Encoder / Conditioning Text (via tokenizer) or image (via vision model) to semantic map (e.g CLiP text encoder) 2. Sampler Generate noise which is starting point to map to content (e.g. k_lms) 3. Diffuser Create vector content based on resolved noise + semantic map (e.g. actual stable diffusion checkpoint) 4. Autoencoder Maps between latent and pixel space (actually creates images from vectors) (e.g. typically some image-database trained GAN) 5. Denoising Get meaningful images from pixel signatures Basically, blends what autoencoder inserted using information from diffuser (e.g. U-NET) 6. Loop and repeat From step#3 with cross-attention to blend results 7. Run additional models as needed - Upscale (e.g. ESRGAN) - Resore Face (e.g. GFPGAN or CodeFormer)