Thuchk's picture
deploy to huggingface spaces
98a48b9
|
raw
history blame
4.72 kB
# Task 1: Choosing model
# Chosen model: Stable Diffusion text-to-image fine-tuning
The `train_text_to_image.py` script shows how to fine-tune stable diffusion model on your own dataset.
### How to install the code requirements.
First, clone the repo and then create a conda env from the env.yaml file and activate the env
```bash
git clone https://github.com/hoangkimthuc/diffusers.git
cd diffusers/examples/text_to_image
conda env create -f env.yaml
conda activate stable_diffusion
```
Before running the scripts, make sure to install the library's training dependencies:
**Important**
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
```bash
cd diffusers
pip install .
```
Then cd in the diffusers/examples/text_to_image folder and run
```bash
pip install -r requirements.txt
```
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
```bash
accelerate config
```
### Steps to run the training.
You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
Run the following command to authenticate your token
```bash
huggingface-cli login
```
If you have already cloned the repo, then you won't need to go through these steps.
<br>
#### Hardware
With `gradient_checkpointing` and `mixed_precision` it should be possible to fine tune the model on a single 24GB GPU. For higher `batch_size` and faster training it's better to use GPUs with >30GB memory.
**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**
```bash
bash train.sh
```
### Sample input/output after training
Once the training is finished the model will be saved in the `output_dir` specified in the command. In this example it's `sd-pokemon-model`. To load the fine-tuned model for inference just pass that path to `StableDiffusionPipeline`
```python
from diffusers import StableDiffusionPipeline
model_path = "sd-pokemon-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")
image = pipe(prompt="yoda").images[0]
image.save("yoda-pokemon.png")
```
The output with the prompt "yoda" is saved in the `yoda-pokemon.png` image file.
### Name and link to the training dataset.
Dataset name: pokemon-blip-captions
Dataset link: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
### The number of model parameters to determine the model’s complexity.
Note: CLIPTextModel (text conditioning model) and AutoencoderKL (image generating model) are frozen, only the Unet (the diffusion model) is trained.
The number of trainable parameters in the script: 859_520_964
To get this number, you can put a breakpoint by calling `breakpoint()` at line 813 of the `train_text_to_image.py` file and then run `train.sh`. Once the pbd session stops at that line, you can check the model's parameters by `p unet.num_parameters()`.
### The model evaluation metric (CLIP score)
CLIP score is a measure of how well the generated images match the prompts.
Validation prompts to calculate the CLIP scores:
```python
prompts = [
"a photo of an astronaut riding a horse on mars",
"A high tech solarpunk utopia in the Amazon rainforest",
"A pikachu fine dining with a view to the Eiffel Tower",
"A mecha robot in a favela in expressionist style",
"an insect robot preparing a delicious meal",
"A small cabin on top of a snowy mountain in the style of Disney, artstation",
]
```
To calculate the CLIP score for the above prompts, run:
```bash
python metrics.py
```
### Link to the trained model
https://drive.google.com/file/d/1xzVUO0nZn-0oaJgHOWjrYKHmGUlsoJ1g/view?usp=sharing
### Modifications made to the original code
- Add metrics and gradio_app scripts
- Remove redundunt code
- Add training bash script
- Improve readme
- Add conda env.yaml file and add more dependencies for the web app
# Task 2: Using the model in a web application
To create