File size: 4,815 Bytes
affb06a
 
 
 
 
 
 
 
 
 
507fca9
95f35f5
3940f73
9c21698
3940f73
507fca9
 
affb06a
 
 
3940f73
507fca9
 
 
 
 
 
 
 
 
faa0ffc
 
 
 
 
 
507fca9
 
affb06a
 
507fca9
 
 
 
 
95f35f5
507fca9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2372c29
 
 
 
 
d58c90c
 
 
 
 
2372c29
 
d58c90c
 
 
 
 
2372c29
 
 
 
 
 
 
 
d58c90c
 
2372c29
d58c90c
b196910
 
 
56e6fbb
1f8add5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
license: openrail++
datasets:
- friedrichor/PhotoChat_120_square_HQ
language:
- en
tags:
- stable-diffusion
- text-to-image
---

This `friedrichor/stable-diffusion-2-1-realistic` model fine-tuned from [stable-diffusion-2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with [friedrichor/PhotoChat_120_square_HQ](https://huggingface.co/datasets/friedrichor/PhotoChat_120_square_HQ)  

This model is not trained solely for Text-to-Image tasks, but as a part of the *Tiger*(currently not open-source and submission) model for Multimodal Dialogue Response Generation.

# Model Details

- **Model type:** Diffusion-based text-to-image generation model
- **Language(s):** English
- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL)
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip)).

## Dataset
[friedrichor/PhotoChat_120_square_HQ](https://huggingface.co/datasets/friedrichor/PhotoChat_120_square_HQ) was used for fine-tuning Stable Diffusion v2.1.  

120 image-text pairs  

Images were manually screened from the [PhotoChat](https://aclanthology.org/2021.acl-long.479/) dataset, cropped to square, and `Gigapixel` was used to improve the quality.   
Image captions are generated by [BLIP-2](https://arxiv.org/abs/2301.12597).

## How to fine-tuning

[friedrichor/Text-to-Image-Summary/fine-tune/text2image](https://github.com/friedrichor/Text-to-Image-Summary/tree/main/fine-tune/text2image)

or [Hugging Face diffusers](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)

# Simple use example

Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers)

```python
import torch
from diffusers import StableDiffusionPipeline

device = "cuda:0"
pipe = StableDiffusionPipeline.from_pretrained("friedrichor/stable-diffusion-2-1-realistic", torch_dtype=torch.float32)
pipe.to(device)

prompt = "a woman in a red and gold costume with feathers on her head"
extra_prompt = ", facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography"
negative_prompt = "cartoon, anime, ugly, (aged, white beard, black skin, wrinkle:1.1), (bad proportions, unnatural feature, incongruous feature:1.4), (blurry, un-sharp, fuzzy, un-detailed skin:1.2), (facial contortion, poorly drawn face, deformed iris, deformed pupils:1.3), (mutated hands and fingers:1.5), disconnected hands, disconnected limbs"

generator = torch.Generator(device=device).manual_seed(42)
image = pipe(prompt + extra_prompt,
             negative_prompt=negative_prompt,
             height=768, width=768,
             num_inference_steps=20,
             guidance_scale=7.5,
             generator=generator).images[0]
image.save("image.png")
```

## Prompt template

**Applying prompt templates is helpful for improving image quality**  

If you want to generate images with human in the real world, you can try the following prompt template.  

`
{{caption}}, facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography
`
<hr>

If you want to generate images in the real world without human, you can try the following prompt template.  

`
{{caption}}, depth of field. bokeh. soft light. by Yasmin Albatoul, Harry Fayt. centered. extremely detailed. Nikon D850, (35mm|50mm|85mm). award winning photography.
`
<hr>

For more prompt templates, see [Dalabad/stable-diffusion-prompt-templates](https://github.com/Dalabad/stable-diffusion-prompt-templates), [r/StableDiffusion](https://www.reddit.com/r/StableDiffusion/), etc.  

## Negative prompt

**Applying negative prompt is also helpful for improving image quality**  

For example,

`
cartoon, anime, ugly, (aged, white beard, black skin, wrinkle:1.1), (bad proportions, unnatural feature, incongruous feature:1.4), (blurry, un-sharp, fuzzy, un-detailed skin:1.2), (facial contortion, poorly drawn face, deformed iris, deformed pupils:1.3), (mutated hands and fingers:1.5), disconnected hands, disconnected limbs
`

# Hosted inference API

You can use the **Hosted inference API** on the right by inputting prompts.  

For example,   

`a woman in a red and gold costume with feathers on her head, facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography`