---
title: Demo Stable Diffusion v1.4
emoji: 🤗
colorFrom: yellow
colorTo: orange
sdk: gradio
app_file: gradio_app.py
pinned: false
---

# Task 1: Choosing model

# Chosen model: Stable Diffusion text-to-image fine-tuning

The `train_text_to_image.py` script shows how to fine-tune stable diffusion model on your own dataset.

### How to install the code requirements.

First, clone the repo and then create a conda env from the env.yaml file and activate the env
```bash
git clone https://github.com/hoangkimthuc/diffusers.git
cd diffusers/examples/text_to_image
conda env create -f env.yaml
conda activate stable_diffusion
```

Before running the scripts, make sure to install the library's training dependencies:

**Important**

To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
cd diffusers
pip install .
```

Then cd in the diffusers/examples/text_to_image folder and run
```bash
pip install -r requirements.txt
```

And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:

```bash
accelerate config
```

### Steps to run the training.

You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. 

You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).

Run the following command to authenticate your token

```bash
huggingface-cli login
```

If you have already cloned the repo, then you won't need to go through these steps.

<br>

#### Hardware
With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB GPU. For higher `batch_size` and faster training it's better to use GPUs with >30GB memory.

**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**

```bash
bash train.sh
```

### Sample input/output after training

Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-pokemon-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`


```python
from diffusers import StableDiffusionPipeline

model_path = "sd-pokemon-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(prompt="yoda").images[0]
image.save("yoda-pokemon.png")
```
The output with the prompt "yoda" is saved in the `yoda-pokemon.png` image file.

### Name and link to the training dataset.

Dataset name: pokemon-blip-captions

Dataset link: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions

### The number of model parameters to determine the model’s complexity.

Note: CLIPTextModel (text conditioning model) and AutoencoderKL (image generating model) are frozen, only the Unet (the diffusion model) is trained.
The number of trainable parameters in the script: 859_520_964
To get this number, you can put a breakpoint by calling `breakpoint()` at line 813 of the `train_text_to_image.py` file and then run `train.sh`. Once the pbd session stops at that line, you can check the model's parameters by `p unet.num_parameters()`.

### The model evaluation metric (CLIP score)
CLIP score is a measure of how well the generated images match the prompts.

Validation prompts to calculate the CLIP scores:
```python
prompts = [
    "a photo of an astronaut riding a horse on mars",
    "A high tech solarpunk utopia in the Amazon rainforest",
    "A pikachu fine dining with a view to the Eiffel Tower",
    "A mecha robot in a favela in expressionist style",
    "an insect robot preparing a delicious meal",
    "A small cabin on top of a snowy mountain in the style of Disney, artstation",
]
```
To calculate the CLIP score for the above prompts, run:
```bash
python metrics.py
```

### Link to the trained model

https://drive.google.com/file/d/1xzVUO0nZn-0oaJgHOWjrYKHmGUlsoJ1g/view?usp=sharing

### Modifications made to the original code
- Add metrics and gradio_app scripts
- Remove redundunt code
- Add training bash script
- Improve readme
- Add conda env.yaml file and add more dependencies for the web app

# Task 2: Using the model in a web application

To create