--- license: other datasets: - Mitsua/vroid-image-dataset-lite pipeline_tag: text-to-image --- # Model Card for VRoid Diffusion This is a latent text-to-image diffusion model to demonstrate how U-Net training affects the generated images. - Text Encoder is from [OpenCLIP ViT-H/14](https://github.com/mlfoundations/open_clip), MIT License, Training Data : LAION-2B - VAE is from [Mitsua Diffusion One](https://huggingface.co/Mitsua/mitsua-diffusion-one), Mitsua Open RAIL-M License, Training Data: Public Domain/CC0 + Licensed - U-Net is trained from scratch using full version of [VRoid Image Dataset Lite](https://huggingface.co/datasets/Mitsua/vroid-image-dataset-lite) with some modifications. - VRoid is a trademark or registered trademark of Pixiv inc. in Japan and other regions. ## Model Details - `vroid_diffusion_test.safetensors` - base variant. - `vroid_diffusion_test_invert_red_blue.safetensors` - `red` and `blue` in the caption is swapped. - `pink` and `skyblue` in the caption is swapped. - `vroid_diffusion_test_monochrome.safetensors` - all training images are converted to grayscale. ## Model Variant - [VRoid Diffusion Unconditional](https://huggingface.co/Mitsua/vroid-diffusion-test-unconditional) - This is unconditional image generator without CLIP. ### Model Description - **Developed by:** Abstract Engine. - **License:** Mitsua Open RAIL-M License. ## Uses ### Direct Use Text-to-Image generation for research and educational purposes. ### Out-of-Scope Use Any deployed use case of the model. ## Training Details - Trained resolution : 256x256 - Batch Size : 48 - Steps : 45k - LR : 1e-5 with warmup 1000 steps ### Training Data We use full version of [VRoid Image Dataset Lite](https://huggingface.co/datasets/Mitsua/vroid-image-dataset-lite) with some modifications.