---
license: openrail++
datasets:
- friedrichor/PhotoChat_120_square_HQ
language:
- en
tags:
- stable-diffusion
- text-to-image
---

This `friedrichor/stable-diffusion-2-1-realistic` model fine-tuned from [stable-diffusion-2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with [friedrichor/PhotoChat_120_square_HQ](https://huggingface.co/datasets/friedrichor/PhotoChat_120_square_HQ)  

This model is not trained solely for Text-to-Image tasks, but as a part of the *Tiger*(currently not open-source and submission) model for Multimodal Dialogue Response Generation.

# Model Details

- **Model type:** Diffusion-based text-to-image generation model
- **Language(s):** English
- **License:** [CreativeML Open RAIL++-M License](https://huggingface.co/stabilityai/stable-diffusion-2/blob/main/LICENSE-MODEL)
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([OpenCLIP-ViT/H](https://github.com/mlfoundations/open_clip)).

## Dataset
[friedrichor/PhotoChat_120_square_HQ](https://huggingface.co/datasets/friedrichor/PhotoChat_120_square_HQ) was used for fine-tuning Stable Diffusion v2.1.  

120 image-text pairs  

Images were manually screened from the [PhotoChat](https://aclanthology.org/2021.acl-long.479/) dataset, cropped to square, and `Gigapixel` was used to improve the quality.   
Image captions are generated by [BLIP-2](https://arxiv.org/abs/2301.12597).

## How to fine-tuning

[friedrichor/Text-to-Image-Summary/fine-tune/text2image](https://github.com/friedrichor/Text-to-Image-Summary/tree/main/fine-tune/text2image)

or [Hugging Face diffusers](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image)

# Simple use example

Using the [🤗's Diffusers library](https://github.com/huggingface/diffusers)

```python
import torch
from diffusers import StableDiffusionPipeline

device = "cuda:0"
pipe = StableDiffusionPipeline.from_pretrained("friedrichor/stable-diffusion-2-1-realistic", torch_dtype=torch.float32)
pipe.to(device)

prompt = "a woman in a red and gold costume with feathers on her head"
extra_prompt = ", facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography"
negative_prompt = "cartoon, anime, ugly, (aged, white beard, black skin, wrinkle:1.1), (bad proportions, unnatural feature, incongruous feature:1.4), (blurry, un-sharp, fuzzy, un-detailed skin:1.2), (facial contortion, poorly drawn face, deformed iris, deformed pupils:1.3), (mutated hands and fingers:1.5), disconnected hands, disconnected limbs"

generator = torch.Generator(device=device).manual_seed(42)
image = pipe(prompt + extra_prompt,
             negative_prompt=negative_prompt,
             height=768, width=768,
             num_inference_steps=20,
             guidance_scale=7.5,
             generator=generator).images[0]
image.save("image.png")
```

## Prompt template

**Applying prompt templates is helpful for improving image quality**  

If you want to generate images with human in the real world, you can try the following prompt template.  

`
{{caption}}, facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography
`
<hr>

If you want to generate images in the real world without human, you can try the following prompt template.  

`
{{caption}}, depth of field. bokeh. soft light. by Yasmin Albatoul, Harry Fayt. centered. extremely detailed. Nikon D850, (35mm|50mm|85mm). award winning photography.
`
<hr>

For more prompt templates, see [Dalabad/stable-diffusion-prompt-templates](https://github.com/Dalabad/stable-diffusion-prompt-templates), [r/StableDiffusion](https://www.reddit.com/r/StableDiffusion/), etc.  

## Negative prompt

**Applying negative prompt is also helpful for improving image quality**  

For example,

`
cartoon, anime, ugly, (aged, white beard, black skin, wrinkle:1.1), (bad proportions, unnatural feature, incongruous feature:1.4), (blurry, un-sharp, fuzzy, un-detailed skin:1.2), (facial contortion, poorly drawn face, deformed iris, deformed pupils:1.3), (mutated hands and fingers:1.5), disconnected hands, disconnected limbs
`

# Hosted inference API

You can use the **Hosted inference API** on the right by inputting prompts.  

For example,   

`a woman in a red and gold costume with feathers on her head, facing the camera, photograph, highly detailed face, depth of field, moody light, style by Yasmin Albatoul, Harry Fayt, centered, extremely detailed, Nikon D850, award winning photography`