ehsanakh commited on
Commit
b4433b4
1 Parent(s): 5d6fdc5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: playground-v2-community
4
+ license_link: https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic/blob/main/LICENSE.md
5
+ tags:
6
+ - text-to-image
7
+ - playground
8
+ ---
9
+ # Playground v2 – 1024px Aesthetic Model
10
+
11
+ This repository contains a model that generates highly aesthetic images of resolution 1024x1024. You can use the model with Hugging Face 🧨 Diffusers.
12
+
13
+ < Insert teaser images here >
14
+
15
+ **Playground v2** is a diffusion-based text-to-image generative model. The model was trained from scratch by the research team at [Playground](https://playground.com).
16
+
17
+ Playground v2’s images are favored 2.5 times more than those produced by Stable Diffusion XL, according to Playground’s [user study](#user-study).
18
+
19
+ We are thrilled to release all intermediate checkpoints at different training stages, including evaluation metrics, to the community. We hope this will foster more foundation model research in pixels.
20
+
21
+ Lastly, we introduce a new benchmark, [MJHQ-30K](#mjhq-30k-Benchmark), for automatic evaluation of a model’s aesthetic quality.
22
+
23
+ ### Model Description
24
+
25
+ - **Developed by:** [Playground](https://playground.com)
26
+ - **Model type:** Diffusion-based text-to-image generative model
27
+ - **License:** [Playground v2 Community License](https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic/blob/main/LICENSE.md)
28
+ - **Model Description:** This model generates images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses two fixed, pre-trained text encoders ([OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main)). It follows the same architecture as [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
29
+
30
+ ### Using the model with 🧨 Diffusers
31
+
32
+ Install diffusers >= 0.19.0 and some dependencies:
33
+ ```
34
+ pip install invisible_watermark transformers accelerate safetensors
35
+ ```
36
+
37
+ To use the model, run:
38
+
39
+ ```python
40
+ from diffusers import DiffusionPipeline
41
+ import torch
42
+
43
+ pipe = DiffusionPipeline.from_pretrained(
44
+ "playgroundai/playground-v2-1024px-aesthetic",
45
+ torch_dtype=torch.float16,
46
+ use_safetensors=True,
47
+ use_watermark=False,
48
+ variant="fp16"
49
+ )
50
+ pipe.to("cuda")
51
+
52
+ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
53
+ image = pipe(prompt=prompt).images[0]
54
+ ```
55
+
56
+ ### User Study
57
+
58
+ According to user studies conducted by Playground, involving over 2,600 prompts and thousands of users, the images generated by Playground v2 are favored 2.5 times more than those produced by [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
59
+
60
+ We report user preference metrics on [PartiPrompts](https://github.com/google-research/parti), following standard practice, and on an internal prompt dataset curated by the Playground team. The “Internal 1K” prompt dataset is diverse and covers various categories and tasks.
61
+
62
+ During the user study, we give users instructions to evaluate image pairs based on both (1) their aesthetic preference and (2) the image-text alignment.
63
+
64
+ < [Aleks] show a screenshot of the Playground UI here? [Daiqing] I think it is fine to just explain>
65
+
66
+ ### MJHQ-30K Benchmark
67
+
68
+ We introduce a new benchmark, [MJHQ-30K](https://huggingface.co/datasets/playgroundai/MJHQ30K), for automatic evaluation of a model’s aesthetic quality. The benchmark computes FID on a high-quality dataset to gauge aesthetic quality.
69
+
70
+ We curate the high-quality dataset from Midjourney with 10 common categories, each category with 3K samples. Following common practice, we use aesthetic score and CLIP score to ensure high image quality and high image-text alignment. Furthermore, we take extra care to make the data diverse within each category.
71
+
72
+ For Playground v2, we report both the overall FID and per-category FID. (All FID metrics are computed at resolution 256x256.)
73
+
74
+ We release this benchmark to the public and encourage the community to adopt it for benchmarking their models’ aesthetic quality.
75
+
76
+ ### Base Models for all resolution
77
+
78
+ Model
79
+ FID
80
+ Clip Score
81
+ *SDXL-1-0-refiner
82
+ 13.04
83
+ 32.62
84
+
85
+
86
+
87
+
88
+
89
+
90
+ playground-v2-256px-base
91
+ 9.83
92
+ 31.90
93
+ playground-v2-512px-base
94
+ 9.55
95
+ 32.08
96
+ playground-v2-1024px-base
97
+ 9.97
98
+ 31.90
99
+
100
+
101
+ Apart from playground-v2-1024px-aesthetic, we release all intermediate checkpoints at different training stages to the community in order to foster foundation model research in pixels. Here, we report the FID score and CLIP score on the MSCOCO14 evaluation set for the reference purposes. (Note that our reported numbers may differ from the numbers reported in SDXL’s published results, as our prompt list may be different.)