multimodalart's picture
Upload 2025 files
22a452a verified
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# InstructPix2Pix
[InstructPix2Pix](https://huggingface.co/papers/2211.09800)๋Š” text-conditioned diffusion ๋ชจ๋ธ์ด ํ•œ ์ด๋ฏธ์ง€์— ํŽธ์ง‘์„ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์€ ๋‹ค์Œ์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
<p align="center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/>
</p>
์ถœ๋ ฅ์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์— ํŽธ์ง‘ ์ง€์‹œ๊ฐ€ ๋ฐ˜์˜๋œ "์ˆ˜์ •๋œ" ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค:
<p align="center">
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
</p>
`train_instruct_pix2pix.py` ์Šคํฌ๋ฆฝํŠธ([์—ฌ๊ธฐ](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)๋Š” ํ•™์Šต ์ ˆ์ฐจ๋ฅผ ์„ค๋ช…ํ•˜๊ณ  Stable Diffusion์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
*** `train_instruct_pix2pix.py`๋Š” [์›๋ž˜ ๊ตฌํ˜„](https://github.com/timothybrooks/instruct-pix2pix)์— ์ถฉ์‹คํ•˜๋ฉด์„œ InstructPix2Pix ํ•™์Šต ์ ˆ์ฐจ๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์žˆ์ง€๋งŒ, [์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)์—์„œ๋งŒ ํ…Œ์ŠคํŠธ๋ฅผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ผ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด, ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋” ๊ธธ๊ฒŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. [์—ฌ๊ธฐ](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)์—์„œ InstructPix2Pix ํ•™์Šต์„ ์œ„ํ•ด ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
***
## PyTorch๋กœ ๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•˜๊ธฐ
### ์ข…์†์„ฑ(dependencies) ์„ค์น˜ํ•˜๊ธฐ
์ด ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์ „์—, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ•™์Šต ์ข…์†์„ฑ์„ ์„ค์น˜ํ•˜์„ธ์š”:
**์ค‘์š”**
์ตœ์‹  ๋ฒ„์ „์˜ ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•ด, **์›๋ณธ์œผ๋กœ๋ถ€ํ„ฐ ์„ค์น˜**ํ•˜๋Š” ๊ฒƒ๊ณผ ์˜ˆ์ œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ž์ฃผ ์—…๋ฐ์ดํŠธํ•˜๊ณ  ์˜ˆ์ œ๋ณ„ ์š”๊ตฌ์‚ฌํ•ญ์„ ์„ค์น˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์‹  ์ƒํƒœ๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ์ƒˆ๋กœ์šด ๊ฐ€์ƒ ํ™˜๊ฒฝ์—์„œ ๋‹ค์Œ ์Šคํ…์„ ์‹คํ–‰ํ•˜์„ธ์š”:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .
```
cd ๋ช…๋ น์–ด๋กœ ์˜ˆ์ œ ํด๋”๋กœ ์ด๋™ํ•˜์„ธ์š”.
```bash
cd examples/instruct_pix2pix
```
์ด์ œ ์‹คํ–‰ํ•˜์„ธ์š”.
```bash
pip install -r requirements.txt
```
๊ทธ๋ฆฌ๊ณ  [๐Ÿค—Accelerate](https://github.com/huggingface/accelerate/) ํ™˜๊ฒฝ์—์„œ ์ดˆ๊ธฐํ™”ํ•˜์„ธ์š”:
```bash
accelerate config
```
ํ˜น์€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ๋ฌธ ์—†์ด ๊ธฐ๋ณธ์ ์ธ accelerate ๊ตฌ์„ฑ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”.
```bash
accelerate config default
```
ํ˜น์€ ์‚ฌ์šฉ ์ค‘์ธ ํ™˜๊ฒฝ์ด notebook๊ณผ ๊ฐ™์€ ๋Œ€ํ™”ํ˜• ์‰˜์€ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋Š” ๋‹ค์Œ ์ ˆ์ฐจ๋ฅผ ๋”ฐ๋ผ์ฃผ์„ธ์š”.
```python
from accelerate.utils import write_basic_config
write_basic_config()
```
### ์˜ˆ์‹œ
์ด์ „์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, ํ•™์Šต์„ ์œ„ํ•ด [์ž‘์€ ๋ฐ์ดํ„ฐ์…‹](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ ๋ฐ์ดํ„ฐ์…‹์€ InstructPix2Pix ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ [์›๋ž˜์˜ ๋ฐ์ดํ„ฐ์…‹](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)๋ณด๋‹ค ์ž‘์€ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด, [ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๋งŒ๋“ค๊ธฐ](create_dataset) ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.
`MODEL_NAME` ํ™˜๊ฒฝ ๋ณ€์ˆ˜(ํ—ˆ๋ธŒ ๋ชจ๋ธ ๋ ˆํฌ์ง€ํ† ๋ฆฌ ๋˜๋Š” ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ํฌํ•จ๋œ ํด๋” ๊ฒฝ๋กœ)๋ฅผ ์ง€์ •ํ•˜๊ณ  [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) ์ธ์ˆ˜์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. `DATASET_ID`์— ๋ฐ์ดํ„ฐ์…‹ ์ด๋ฆ„์„ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
```bash
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export DATASET_ID="fusing/instructpix2pix-1000-samples"
```
์ง€๊ธˆ, ํ•™์Šต์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šคํฌ๋ฆฝํŠธ๋Š” ๋ ˆํฌ์ง€ํ† ๋ฆฌ์˜ ํ•˜์œ„ ํด๋”์˜ ๋ชจ๋“  ๊ตฌ์„ฑ์š”์†Œ(`feature_extractor`, `scheduler`, `text_encoder`, `unet` ๋“ฑ)๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
```bash
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_ID \
--enable_xformers_memory_efficient_attention \
--resolution=256 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--seed=42 \
--push_to_hub
```
์ถ”๊ฐ€์ ์œผ๋กœ, ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค๋ฅผ ํ•™์Šต ๊ณผ์ •์— ๋ชจ๋‹ˆํ„ฐ๋งํ•˜์—ฌ ๊ฒ€์ฆ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. `report_to="wandb"`์™€ ์ด ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```bash
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_ID \
--enable_xformers_memory_efficient_attention \
--resolution=256 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
--validation_prompt="make the mountains snowy" \
--seed=42 \
--report_to=wandb \
--push_to_hub
```
๋ชจ๋ธ ๋””๋ฒ„๊น…์— ์œ ์šฉํ•œ ์ด ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด `wandb`๋ฅผ ์„ค์น˜ํ•˜๋Š” ๊ฒƒ์„ ์ฃผ๋ชฉํ•ด์ฃผ์„ธ์š”. `pip install wandb`๋กœ ์‹คํ–‰ํ•ด `wandb`๋ฅผ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
[์—ฌ๊ธฐ](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), ๋ช‡ ๊ฐ€์ง€ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๊ณผ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํฌํ•จํ•˜๋Š” ์˜ˆ์‹œ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
***์ฐธ๊ณ : ์›๋ณธ ๋…ผ๋ฌธ์—์„œ, ์ €์ž๋“ค์€ 256x256 ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ๋กœ 512x512์™€ ๊ฐ™์€ ๋” ํฐ ํ•ด์ƒ๋„๋กœ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต์— ์‚ฌ์šฉํ•œ ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.***
## ๋‹ค์ˆ˜์˜ GPU๋กœ ํ•™์Šตํ•˜๊ธฐ
`accelerate`๋Š” ์›ํ™œํ•œ ๋‹ค์ˆ˜์˜ GPU๋กœ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. `accelerate`๋กœ ๋ถ„์‚ฐ ํ•™์Šต์„ ์‹คํ–‰ํ•˜๋Š” [์—ฌ๊ธฐ](https://huggingface.co/docs/accelerate/basic_tutorials/launch) ์„ค๋ช…์„ ๋”ฐ๋ผ ํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ์˜ˆ์‹œ์˜ ๋ช…๋ น์–ด ์ž…๋‹ˆ๋‹ค:
```bash
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \
--pretrained_model_name_or_path=stable-diffusion-v1-5/stable-diffusion-v1-5 \
--dataset_name=sayakpaul/instructpix2pix-1000-samples \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=512 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--seed=42 \
--push_to_hub
```
## ์ถ”๋ก ํ•˜๊ธฐ
์ผ๋‹จ ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด, ์ถ”๋ก  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```python
import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
model_id = "your_model_id" # <- ์ด๋ฅผ ์ˆ˜์ •ํ•˜์„ธ์š”.
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
generator = torch.Generator("cuda").manual_seed(0)
url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"
def download_image(url):
image = PIL.Image.open(requests.get(url, stream=True).raw)
image = PIL.ImageOps.exif_transpose(image)
image = image.convert("RGB")
return image
image = download_image(url)
prompt = "wipe out the lake"
num_inference_steps = 20
image_guidance_scale = 1.5
guidance_scale = 10
edited_image = pipe(
prompt,
image=image,
num_inference_steps=num_inference_steps,
image_guidance_scale=image_guidance_scale,
guidance_scale=guidance_scale,
generator=generator,
).images[0]
edited_image.save("edited_image.png")
```
ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์–ป์€ ์˜ˆ์‹œ์˜ ๋ชจ๋ธ ๋ ˆํฌ์ง€ํ† ๋ฆฌ๋Š” ์—ฌ๊ธฐ [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์„ฑ๋Šฅ์„ ์œ„ํ•œ ์†๋„์™€ ํ’ˆ์งˆ์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:
* `num_inference_steps`
* `image_guidance_scale`
* `guidance_scale`
ํŠนํžˆ, `image_guidance_scale`์™€ `guidance_scale`๋Š” ์ƒ์„ฑ๋œ("์ˆ˜์ •๋œ") ์ด๋ฏธ์ง€์—์„œ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.([์—ฌ๊ธฐ](https://twitter.com/RisingSayak/status/1628392199196151808?s=20)์˜ˆ์‹œ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.)
๋งŒ์•ฝ InstructPix2Pix ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ช‡ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ฐพ๊ณ  ์žˆ๋‹ค๋ฉด, ์ด ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ[Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd)์„ ํ™•์ธํ•ด์ฃผ์„ธ์š”.