Overview

These are latent diffusion transformer models trained from scratch on 100k 256x256 images. Checkpoint 278k-full_state_dict.pth has been trained on about 500 epochs and is well into being overfit on the 100k training images.

The two checkpoints for 300k and 395k steps were further trained on a Midjourney dataset of 600k images for 9.4 epochs (300k steps) and 50 epochs (395k steps) at a constant LR of 5e-5. The additional training on the MJ dataset took ~8 hours on a 4090 with batch size 256.

The models are the same as in the Google Colabs below: embed_dim=512, n_layers=8, total parameters=30507328 (30M)

Run the Models in Colab

https://colab.research.google.com/drive/10yORcKXT40DLvZSceOJ1Hi5z_p5r-bOs?usp=sharing

Colab Training Notebook

https://colab.research.google.com/drive/1sKk0usxEF4bmdCDcNQJQNMt4l9qBOeAM?usp=sharing

Github Repo

This repo contains the original training code: https://github.com/apapiu/transformer_latent_diffusion

Datasets used

https://huggingface.co/apapiu/small_ldt/tree/main

Other

See this Reddit post by u/spring_m (huggingface.co/apapiu) for some more information: https://www.reddit.com/r/MachineLearning/comments/198eiv1/p_small_latent_diffusion_transformer_from_scratch/