|
--- |
|
license: mit |
|
datasets: |
|
- ylecun/mnist |
|
pipeline_tag: unconditional-image-generation |
|
--- |
|
|
|
# DigitDreamer |
|
|
|
**DigitDreamer** is a rectified flow Latent Diffusion Model (LDM) designed for generating MNIST digits with high fidelity. The project combines an F-16 autoencoder (non-KL) with a DiT diffusion model, leveraging a GAN loss during autoencoder training to improve reconstruction quality. |
|
|
|
## Model Overview |
|
|
|
DigitDreamer consists of two main components: |
|
|
|
1. **Autoencoder**: Compresses and reconstructs MNIST digits, trained with both reconstruction and GAN loss for improved detail and realism. |
|
2. **DiT Diffusion Model**: Generates realistic digits in the latent space extracted by the autoencoder. This model operates on compressed latent representations, making it efficient while preserving image fidelity. |
|
|
|
The autoencoder incorporates unique downsampling and upsampling layers, inspired by the _Channel-to-space_ and _Space-to-channel_ configurations as described in [Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models](https://arxiv.org/abs/2410.10733). |
|
|
|
## Training Pipeline |
|
|
|
### Autoencoder Training |
|
|
|
- **Dataset**: The model was initially trained for 20 epochs on an augmented MNIST dataset, providing it with a diverse range of digit variations. Fine-tuning for 2 epochs on the original MNIST dataset sharpened and refined the reconstructions. |
|
- **Losses**: The autoencoder was optimized with a combination of reconstruction loss and GAN loss, resulting in more realistic and detailed digit representations. |
|
|
|
### DiT Diffusion Training |
|
|
|
- **Dataset**: Trained for 10 epochs on the latent representations extracted from the autoencoder, the DiT diffusion model learns to generate coherent digit structures within the latent space. |
|
- **Architecture**: A standard, but smaller, version of the DiT model was used to maintain efficiency while ensuring high-quality outputs. |
|
|
|
## Results |
|
|
|
### Reconstruction Quality |
|
|
|
The autoencoder's reconstruction quality demonstrates high fidelity, retaining essential features of the original digits while minimizing artifacts. |
|
|
|
![Reconstruction](assets/reconstruction.png) |
|
|
|
### Generated Samples |
|
|
|
The DiT model generates realistic and varied samples in the latent space, showcasing the model's capacity to create high-quality MNIST digits. |
|
|
|
![Generated Sample 1](assets/samples1.gif) |
|
|
|
![Generated Sample 2](assets/samples2.gif) |
|
|
|
![Generated Sample 3](assets/samples3.gif) |
|
|
|
## References |
|
|
|
- _Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models_ [arXiv:2410.10733](https://arxiv.org/abs/2410.10733) |
|
- _High-Resolution Image Synthesis with Latent Diffusion Models_ [arXiv:2112.10752](https://arxiv.org/abs/2112.10752) |
|
- _Scalable Diffusion Models with Transformers_ [arXiv:2212.09748](https://arxiv.org/abs/2212.09748) |
|
- _minRF_ [GitHub](https://github.com/cloneofsimo/minRF) |