rain1011 commited on
Commit
38f9920
1 Parent(s): 3661d0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +399 -0
README.md CHANGED
@@ -1,3 +1,402 @@
1
  ---
2
  license: llama2
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ pipeline_tag: text-to-image
4
  ---
5
+
6
+ # LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
7
+ This is the latest version (LaVITv2) for the multi-modal large language model: **LaVIT**.
8
+
9
+ In this version, We further improve LaVIT's image generation capability. In the updated version, the **aesthetic** and **prompt-alignment** of generated images has been improved. The **probability of watermark** is also greatly reduced. The improvements are summarized as follows:
10
+ * Using LaVIT to generate better synthetic captions for the noisy Laion-Aesthetic (Like DALL-E 3).
11
+ * Add the high-aesthetic training images from the open-source JourneyDB dataset.
12
+ * Using the 20M synthetic Laion-Aesthetic data and 4.2M JourneyDB data to further finetune the LLM for 8K steps.
13
+
14
+
15
+ [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]
16
+
17
+
18
+ ## Setup
19
+
20
+ ### Requirements
21
+
22
+ The code for this repo is tested with PyTorch 1.13.1 and CUDA 11.7.
23
+ You should first install and configure the Pytorch Environment (including torch and torchvision) can then install the requirements with the following commands:
24
+
25
+ ```shell
26
+ git clone https://github.com/jy0205/LaVIT.git
27
+ cd LaVIT
28
+ pip install -r requirements.txt
29
+ ```
30
+
31
+ * (Optional) We recommend using memory efficient attention by installing xFormers following the instructions in [here](https://huggingface.co/docs/diffusers/main/en/optimization/xformers). Then, you can set the argument `use_xformers=True` in `build_model` function to save the GPU memory and speed up inference.
32
+
33
+ ### Model Zoo
34
+ We release the LaVIT weight that is built upon [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) as the large language model.
35
+ > Note: Due to the license restrictions of Llama1, we cannot publish its weights. Thus, we release the weight of LaVIT based on the Llama2.
36
+
37
+ The latest pre-trained weight of LaVIT can be found on the huggingface from [here](https://huggingface.co/rain1011/LaVIT-7B-v2), which will take around 25GB of disk space. We strongly recommend you to download and use the latest version of LaVIT. LaVIT achieves state-of-the-arts performance on various multi-modal downstream tasks. The detailed quantitive results are shown as follows:
38
+
39
+
40
+ #### Zero-shot Multi-modal Understanding
41
+
42
+ <table>
43
+ <thead align="center">
44
+ <tr>
45
+ <th rowspan="2">Model</th>
46
+ <th colspan="3">Image Captioning</th>
47
+ <th colspan="4">Visual Question Answering</th>
48
+ </tr>
49
+ <tr>
50
+ <th>COCO</th>
51
+ <th>NoCaps</th>
52
+ <th>Flickr30K</th>
53
+ <th>VQAv2</th>
54
+ <th>OK-VQA</th>
55
+ <th>GQA</th>
56
+ <th>VizWiz</th>
57
+ </tr>
58
+ </thead>
59
+ <tbody align="center">
60
+ <tr>
61
+ <td>Flamingo-3B</td>
62
+ <td>73.0</td>
63
+ <td>-</td>
64
+ <td>60.6</td>
65
+ <td>49.2</td>
66
+ <td>41.2</td>
67
+ <td>-</td>
68
+ <td>28.9</td>
69
+ </tr>
70
+ <tr>
71
+ <td>Flamingo-9B</td>
72
+ <td>79.4</td>
73
+ <td>-</td>
74
+ <td>61.5</td>
75
+ <td>51.8</td>
76
+ <td>44.7</td>
77
+ <td>-</td>
78
+ <td>28.8</td>
79
+ </tr>
80
+ <tr>
81
+ <td>OpenFlamingo-9B</td>
82
+ <td>79.5</td>
83
+ <td>-</td>
84
+ <td>59.5</td>
85
+ <td>52.7</td>
86
+ <td>37.8</td>
87
+ <td>-</td>
88
+ <td>27.5</td>
89
+ </tr>
90
+ <tr>
91
+ <td>MetaLM</td>
92
+ <td>82.2</td>
93
+ <td>-</td>
94
+ <td>43.4</td>
95
+ <td>41.1</td>
96
+ <td>11.4</td>
97
+ <td>-</td>
98
+ <td>-</td>
99
+ </tr>
100
+ <tr>
101
+ <td>Kosmos-1</td>
102
+ <td>84.7</td>
103
+ <td>-</td>
104
+ <td>67.1</td>
105
+ <td>51.0</td>
106
+ <td>-</td>
107
+ <td>-</td>
108
+ <td>29.2</td>
109
+ </tr>
110
+ <tr>
111
+ <td>Kosmos-2</td>
112
+ <td>-</td>
113
+ <td>-</td>
114
+ <td>80.5</td>
115
+ <td>51.1</td>
116
+ <td>-</td>
117
+ <td>-</td>
118
+ <td>-</td>
119
+ </tr>
120
+ <tr>
121
+ <td>BLIP-2 (Vicuna-7B)</td>
122
+ <td>-</td>
123
+ <td>107.5</td>
124
+ <td>74.9</td>
125
+ <td>-</td>
126
+ <td>-</td>
127
+ <td>41.3</td>
128
+ <td>25.3</td>
129
+ </tr>
130
+ <tr>
131
+ <td>BLIP-2 (Vicuna-13B)</td>
132
+ <td>-</td>
133
+ <td>103.9</td>
134
+ <td>71.6</td>
135
+ <td>-</td>
136
+ <td>-</td>
137
+ <td>32.3</td>
138
+ <td>19.6</td>
139
+ </tr>
140
+ <tr>
141
+ <td>CM3Leon-7B</td>
142
+ <td>61.6</td>
143
+ <td>-</td>
144
+ <td>-</td>
145
+ <td>47.6</td>
146
+ <td>-</td>
147
+ <td>-</td>
148
+ <td>37.6</td>
149
+ </tr>
150
+ <tr>
151
+ <td>Emu (LLaMA-1-13B)</td>
152
+ <td>112.4</td>
153
+ <td>-</td>
154
+ <td>-</td>
155
+ <td>52.0</td>
156
+ <td>38.2</td>
157
+ <td>-</td>
158
+ <td>34.2</td>
159
+ </tr>
160
+ <tr>
161
+ <td>LaVIT (LLaMA-1-7B)</td>
162
+ <td>134.0</td>
163
+ <td><b>114.2</b></td>
164
+ <td>83.0</td>
165
+ <td>66.0</td>
166
+ <td>54.6</td>
167
+ <td>46.8</td>
168
+ <td>38.5</td>
169
+ </tr>
170
+ <tr>
171
+ <td>LaVIT (LLaMA-2-7B)</td>
172
+ <td><b>134.6</b></td>
173
+ <td>113.1</td>
174
+ <td><b>83.2</b></td>
175
+ <td><b>68.2</b></td>
176
+ <td><b>55.7</b></td>
177
+ <td><b>48.0</b></td>
178
+ <td><b>45.3</b></td>
179
+ </tr>
180
+ </tbody>
181
+ </table>
182
+
183
+ #### Zero-shot Text-to-Image Generation
184
+
185
+ <table>
186
+ <thead>
187
+ <tr>
188
+ <th>Method</th>
189
+ <th>Model</th>
190
+ <th>Model type</th>
191
+ <th>FID</th>
192
+ </tr>
193
+ </thead>
194
+ <tbody align="center">
195
+ <tr>
196
+ <td rowspan="9">Text2Image Specialist</td>
197
+ <td>DALL-E</td>
198
+ <td>Autoregressive</td>
199
+ <td>28.0</td>
200
+ </tr>
201
+ <tr>
202
+ <td>CogView</td>
203
+ <td>Autoregressive</td>
204
+ <td>27.1</td>
205
+ </tr>
206
+ <tr>
207
+ <td>StableDiffusion</td>
208
+ <td>Diffusion</td>
209
+ <td>12.6</td>
210
+ </tr>
211
+ <tr>
212
+ <td>GLIDE</td>
213
+ <td>Diffusion</td>
214
+ <td>12.2</td>
215
+ </tr>
216
+ <tr>
217
+ <td>DALL-E 2</td>
218
+ <td>Diffusion</td>
219
+ <td>10.4</td>
220
+ </tr>
221
+ <tr>
222
+ <td>Make-A-Scene</td>
223
+ <td>Autoregressive</td>
224
+ <td>11.8</td>
225
+ </tr>
226
+ <tr>
227
+ <td>MUSE-7.6B</td>
228
+ <td>Non-Autoregressive</td>
229
+ <td>7.9</td>
230
+ </tr>
231
+ <tr>
232
+ <td>Imagen-3.4B</td>
233
+ <td>Diffusion</td>
234
+ <td>7.3</td>
235
+ </tr>
236
+ <tr>
237
+ <td>Parti-20B</td>
238
+ <td>Autoregressive</td>
239
+ <td><b>7.2</b></td>
240
+ </tr>
241
+ <tr>
242
+ <td rowspan="5">Multimodal Large Langauge Model</td>
243
+ <td>GILL (OPT-6.7B)</td>
244
+ <td>LLM</td>
245
+ <td>12.2</td>
246
+ </tr>
247
+ <tr>
248
+ <td>Emu (LLaMA-1-13B)</td>
249
+ <td>LLM</td>
250
+ <td>11.7</td>
251
+ </tr>
252
+ <tr>
253
+ <td>CM3Leon-7B </td>
254
+ <td>LLM</td>
255
+ <td>10.8</td>
256
+ </tr>
257
+ <tr>
258
+ <td>LaVIT (LLaMA-1-7B)</td>
259
+ <td>LLM</td>
260
+ <td>7.4</td>
261
+ </tr>
262
+ <tr>
263
+ <td>LaVIT (LLaMA-2-7B)</td>
264
+ <td>LLM</td>
265
+ <td><b>7.2</b></td>
266
+ </tr>
267
+ </tbody>
268
+ </table>
269
+
270
+ ## Usage
271
+ LaVIT can serve as a multi-modal generalist to perform both multi-modal comprehension and generation. Below, we provide some examples. Only a few lines of code are needed to use **LaVIT** for inference. We also provide the detailed examples in the following jupyter notebooks for learning how to interact with LaVIT.
272
+
273
+ * `understanding.ipynb` : examples for multi-modal understanding
274
+ * `text2image_synthesis.ipynb`: examples for the text-to-image generation.
275
+ * `multimodal_synthesis.ipynb`: examples for image synthesis with multi-modal prompts.
276
+
277
+ ### Multi-modal Understanding
278
+
279
+ ```python
280
+ import os
281
+ import random
282
+ import torch
283
+ import torch.nn as nn
284
+ from models import build_model
285
+ from PIL import Image
286
+
287
+ seed = 1234
288
+ random.seed(seed)
289
+ torch.manual_seed(seed)
290
+
291
+ # The local directory you save the LaVIT pre-trained weight,
292
+ # it will automatically download the checkpoint from huggingface
293
+ model_path = '/path/LaVIT_weight'
294
+
295
+ # Using BFloat16 during inference
296
+ model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
297
+
298
+ # Inference using GPU-0
299
+ device_id = 0
300
+ torch.cuda.set_device(device_id)
301
+ device = torch.device('cuda')
302
+
303
+ # Building LaVIT for understanding and load its weight from huggingface
304
+ model = build_model(model_path=model_path, model_dtype=model_dtype,
305
+ device_id=device_id, use_xformers=False, understanding=True)
306
+ model = model.to(device)
307
+
308
+ # Image Captioning
309
+ image_path = 'demo/caption_image.jpg'
310
+ caption = model.generate({"image": image_path})[0]
311
+ print(caption)
312
+ # an old photo of a horse and buggy in front of a building
313
+
314
+ # Visual Question Answering
315
+ image_path = 'demo/qa_image.jpg'
316
+ question = "What's that drink in the glass?"
317
+ answer = model.predict_answers({"image": image_path, "text_input": question}, max_len=10)[0]
318
+ print("The answer is: ", answer)
319
+ # The answer is: orange juice
320
+ ```
321
+
322
+ ### Text-to-Image Synthesis
323
+
324
+ For the Image generation, the Classifier-Free Guidance scale is important. A larger scale will encourage the model to generate samples highly related to the input prompt while sacrificing the image quality. We set `guidance_scale_for_llm=4.0` by default, you can increase this scale (e.g., 5.0 or 6.0) to encourage the generated image to follow the semantics of given prompts. Besides, you can modify the `ratio` to enable to generate the images with different aspect ratios.
325
+
326
+ ```python
327
+ import os
328
+ import torch
329
+ import random
330
+ import torch.nn as nn
331
+ from models import build_model
332
+ from PIL import Image
333
+
334
+ seed = 1234
335
+ random.seed(seed)
336
+ torch.manual_seed(seed)
337
+
338
+ # The local directory you save the LaVIT pre-trained weight,
339
+ # it will automatically download the checkpoint from huggingface
340
+ model_path = '/path/LaVIT_weight'
341
+
342
+ # Using BFloat16 during inference
343
+ model_dtype = 'bf16' # Or set to fp16 to enable float16 inference
344
+
345
+ # Inference using GPU-0
346
+ device_id = 0
347
+ torch.cuda.set_device(device_id)
348
+ device = torch.device('cuda')
349
+ torch_dtype = torch.bfloat16 if model_dtype=="bf16" else torch.float16
350
+
351
+ # Building LaVIT for Generation and load the weight from huggingface
352
+ # You can set `use_xformers=True` if have installed xformers to save GPU mempry and speed up
353
+ model = build_model(model_path=model_path, model_dtype=model_dtype, device_id=device_id,
354
+ use_xformers=False, understanding=False, load_tokenizer=False)
355
+ model = model.to(device)
356
+
357
+ # Text-to-Image Generation
358
+ prompt = "a sculpture of a duck made of wool"
359
+
360
+ # LaVIT support 6 different image aspect ratios
361
+ ratio_dict = {
362
+ '1:1' : (1024, 1024),
363
+ '4:3' : (896, 1152),
364
+ '3:2' : (832, 1216),
365
+ '16:9' : (768, 1344),
366
+ '2:3' : (1216, 832),
367
+ '3:4' : (1152, 896),
368
+ }
369
+
370
+ # The image aspect ratio you want to generate
371
+ ratio = '1:1'
372
+ height, width = ratio_dict[ratio]
373
+
374
+ with torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
375
+ images = model.generate_image(prompt, width=width, height=height,
376
+ num_return_images=1, guidance_scale_for_llm=4.0, num_inference_steps=25)
377
+
378
+ images[0].save("output/i2t_output.jpg")
379
+ ```
380
+
381
+ ## Evaluation
382
+ The batch evaluation code with multiple GPUs on the adopted multi-modal benchmarks will be released in the following days.
383
+
384
+ ## Acknowledgement
385
+ We are grateful for the following awesome projects when implementing LaVIT:
386
+ * [LLaMA](https://github.com/facebookresearch/llama): Open and Efficient Foundation Language Models
387
+ * [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2): Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
388
+ * [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP): Improved Training Techniques for CLIP at Scale
389
+ * [BEIT](https://github.com/microsoft/unilm/tree/master/beit2): Masked Image Modeling with Vector-Quantized Visual Tokenizers
390
+ * [Diffusers](https://github.com/huggingface/diffusers): State-of-the-art diffusion models for image and audio generation in PyTorch.
391
+
392
+
393
+ ## <a name="Citing"></a>Citation
394
+ Consider giving this repository a star and cite LaVIT in your publications if it helps your research.
395
+
396
+ ```
397
+ @article{jin2023unified,
398
+ title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},
399
+ author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},
400
+ journal={arXiv preprint arXiv:2309.04669},
401
+ year={2023}
402
+ }