File size: 7,443 Bytes
fb6f572
 
 
71306a5
 
fb6f572
71306a5
fb6f572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71306a5
 
fb6f572
 
 
 
71306a5
 
 
 
 
 
 
fb6f572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71306a5
fb6f572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71306a5
fb6f572
 
 
fdddd50
 
 
 
 
 
 
 
 
 
6405360
71306a5
fdddd50
71306a5
 
 
 
 
 
 
 
 
 
 
fdddd50
71306a5
 
 
 
 
 
 
 
 
 
fb6f572
 
 
fdddd50
71306a5
7c9c17e
71306a5
fb6f572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---

license: apache-2.0
language:
  - zh
  - en
base_model:
  - THUDM/glm-4-9b
pipeline_tag: text-to-image
library_name: diffusers
---


# CogView4-6B

<p style="text-align: center;">
  <div align="center">
  <img src=https://github.com/THUDM/CogView4/raw/main/resources/logo.svg width="50%"/>
  </div>
  <p align="center">
  <a href="https://huggingface.co/spaces/THUDM-HF-SPACE/CogView4">πŸ€— Space | </a> 
  <a href="https://github.com/THUDM/CogView4">🌐 Github </a> | 
  <a href="https://arxiv.org/pdf/2403.05121">πŸ“œ arxiv </a>
</p>

![img](https://raw.githubusercontent.com/THUDM/CogView4/refs/heads/main/resources/showcase.png)

## Inference Requirements and Model Introduction

+ Resolution: Width and height must be between `512px` and `2048px`, divisible by `32`, and ensure the maximum number of
  pixels does not exceed `2^21` px.
+ Precision: BF16 / FP32 (FP16 is not supported as it will cause overflow resulting in completely black images)

Using `BF16` precision with `batchsize=4` for testing, the memory usage is shown in the table below:

| Resolution  | enable_model_cpu_offload OFF | enable_model_cpu_offload ON | enable_model_cpu_offload ON </br> Text Encoder 4bit | 

|-------------|------------------------------|-----------------------------|-----------------------------------------------------|

| 512 * 512   | 33GB                         | 20GB                        | 13G                                                 |

| 1280 * 720  | 35GB                         | 20GB                        | 13G                                                 |

| 1024 * 1024 | 35GB                         | 20GB                        | 13G                                                 |

| 1920 * 1280 | 39GB                         | 20GB                        | 14G                                                 |

| 2048 * 2048 | 43GB                         | 21GB                        | 14G                                                 |



## Quick Start



First, ensure you install the `diffusers` library from source.



```shell

pip install git+https://github.com/huggingface/diffusers.git

cd diffusers

pip install -e .

```



Then, run the following code:



```python

from diffusers import CogView4Pipeline



pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)



# Open it for reduce GPU memory usage

pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()

pipe.vae.enable_tiling()

prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
    prompt=prompt,

    guidance_scale=3.5,

    num_images_per_prompt=1,

    num_inference_steps=50,

    width=1024,

    height=1024,

).images[0]


image.save("cogview4.png")
```



### Model Metrics



We've tested on multiple benchmarks and achieved the following scores:



#### DPG-Bench



| Model           | Overall   | Global    | Entity    | Attribute | Relation  | Other     |

|-----------------|-----------|-----------|-----------|-----------|-----------|-----------|

| SDXL            | 74.65     | 83.27     | 82.43     | 80.91     | 86.76     | 80.41     |

| PixArt-alpha    | 71.11     | 74.97     | 79.32     | 78.60     | 82.57     | 76.96     |

| SD3-Medium      | 84.08     | 87.90     | **91.01** | 88.83     | 80.70     | 88.68     |

| DALL-E 3        | 83.50     | **90.97** | 89.61     | 88.39     | 90.58     | 89.83     |

| Flux.1-dev      | 83.79     | 85.80     | 86.79     | 89.98     | 90.04     | **89.90** |

| Janus-Pro-7B    | 84.19     | 86.90     | 88.90     | 89.40     | 89.32     | 89.48     |

| **CogView4-6B** | **85.13** | 83.85     | 90.35     | **91.17** | **91.14** | 87.29     |



#### GenEval



| Model           | Overall  | Single Obj. | Two Obj. | Counting | Colors   | Position | Color attribution |

|-----------------|----------|-------------|----------|----------|----------|----------|-------------------|

| SDXL            | 0.55     | 0.98        | 0.74     | 0.39     | 0.85     | 0.15     | 0.23              |

| PixArt-alpha    | 0.48     | 0.98        | 0.50     | 0.44     | 0.80     | 0.08     | 0.07              |

| SD3-Medium      | 0.74     | **0.99**    | **0.94** | 0.72     | 0.89     | 0.33     | 0.60              |

| DALL-E 3        | 0.67     | 0.96        | 0.87     | 0.47     | 0.83     | 0.43     | 0.45              |

| Flux.1-dev      | 0.66     | 0.98        | 0.79     | **0.73** | 0.77     | 0.22     | 0.45              |

| Janus-Pro-7B    | **0.80** | **0.99**    | 0.89     | 0.59     | **0.90** | **0.79** | **0.66**          |

| **CogView4-6B** | 0.73     | **0.99**    | 0.86     | 0.66     | 0.79     | 0.48     | 0.58              |



#### T2I-CompBench



| Model           | Color      | Shape      | Texture    | 2D-Spatial | 3D-Spatial | Numeracy   | Non-spatial Clip | Complex 3-in-1 |

|-----------------|------------|------------|------------|------------|------------|------------|------------------|----------------|

| SDXL            | 0.5879     | 0.4687     | 0.5299     | 0.2133     | 0.3566     | 0.4988     | 0.3119           | 0.3237         |

| PixArt-alpha    | 0.6690     | 0.4927     | 0.6477     | 0.2064     | 0.3901     | 0.5058     | **0.3197**       | 0.3433         |

| SD3-Medium      | **0.8132** | 0.5885     | **0.7334** | **0.3200** | **0.4084** | 0.6174     | 0.3140           | 0.3771         |

| DALL-E 3        | 0.7785     | **0.6205** | 0.7036     | 0.2865     | 0.3744     | 0.5880     | 0.3003           | 0.3773         |

| Flux.1-dev      | 0.7572     | 0.5066     | 0.6300     | 0.2700     | 0.3992     | 0.6165     | 0.3065           | 0.3628         |

| Janus-Pro-7B    | 0.5145     | 0.3323     | 0.4069     | 0.1566     | 0.2753     | 0.4406     | 0.3137           | 0.3806         |

| **CogView4-6B** | 0.7786     | 0.5880     | 0.6983     | 0.3075     | 0.3708     | **0.6626** | 0.3056           | **0.3869**     |



## Chinese Text Accuracy Evaluation



| Model           | Precision  | Recall     | F1 Score   | Pick@4     |

|-----------------|------------|------------|------------|------------|

| Kolors          | 0.6094     | 0.1886     | 0.2880     | 0.1633     |

| **CogView4-6B** | **0.6969** | **0.5532** | **0.6168** | **0.3265** |



## Citation



🌟 If you find our work helpful, please consider citing our paper and leaving valuable stars



```
@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}
```



## License



This model is released under the [Apache 2.0 License](LICENSE).