LavenderFlow 5.6B v0

large

✅Latent, MMDiT, muP, CFM, FSDP, recaped, 768x768, T5

✅No strings attached, completely-open-every-step-of-the-way

✅Not SoTA😅(but.. not bad considering it was trained by one grad-student under total 3 weeks of development.) Severely undertrained!

How to use

Once the model is trained with more compute, I will probably put more efforts to make this more accesible, but for now, have a look at this notebook You would need to download the latest checkpoint via

wget https://huggingface.co/cloneofsimo/lavenderflow-5.6B/resolve/highres-model_49153/ema1.pt

and load the checkpoint ema1.pt there.

No paper, no blog, not raising cash, who am I and why did I do this?

I just do things for fun, no reasons but...

Foundation models seems to only belong in companies. I tell you, while it is tempting to say: "You can't do this, you need team of engineers and researchers to make billion-scale foundation models" it is not the case, and one bored grad student can obviously very easily pull this off. Now you know, you know.

Warning: This is extremely undertrained model

This is very early preview of my attempt to recreate SD3, trained on single-node, with batch size of 128, for only ~550k steps. Don't expect SoTA SD3, midjourney, dalle3 quality images with this model. I am only one person, with one 8xH100 node. They have clusters of gpus.

What did you do and how long did it take?

My first steps were to implement everything in torch and see if it works in MNIST, CIFAR-10. This took about 8 hours, including reading papers, implementing DiT, etc, everything.
Next was to scale up. I used muTransfer (basically i setup muP) to find optimal learning rate, and to do this I scaled upto 1B model with imagenet.int8 dataset. Seeing the basin alignment, I scaled upto 5.6B parameters.
Then I moved to T2I training. I collected Capfusion Dataset, which is subset of LAION-coco
I deduplicated with SSCD embeddings and FAISS, using clustering method described in SD3.
I cached T5 (new T5-large from aieluther) and SDXL-VAE embeddings for all cropped 256x256 images.
For efficient sharding and to make better use of NFS, I put everything in MosaicStreamingDataset format, which was easy to use.
I rewrote my training code in DeepSpeed, utilizing Zero algorithm of stage-2, which does not shard the weights due to its small size. I got MFU of about 45 ~ 50 ~ 60% depending on the stage.
GPU GO BRR. there were some misconfigurations on the way, so I had to restart 3 times. This took about 550k total steps (3 days total)
Finally, I went through the same process all over again for ye-pop dataset, which is cleaned and recaptioned version of laion-pop dataset. This time with 768x768 resolution.

Acknowlegement

Thank you to all authors of HDiT for bunch of pro tips and encouragements, specially @birchlabs. Thanks to 서승현, who is my coworker at @naver who also shares pro tips, and studies deepspeed with me. Thanks to @imbue.ai who provided most of the compute for this research project! Thanks to them, this is could be done at completely free cost! Thanks to people at @fal.ai who provided compute and tips, specially @isidentital and @burkaygur !!

Whats the future?

@fal.ai has decided to provide more compute for this project in the future, so we decided to collaborate to make something much bigger, much better. You will see the upcoming version of this project with different names. In the near future, I also hope to work on other modalities, but I have high chance of doing that at Naver without much being open sourced.