Vision-CAIR commited on
Commit
22734d5
·
1 Parent(s): f8621f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -139
README.md CHANGED
@@ -1,139 +1,13 @@
1
- # MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
2
- [Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
3
-
4
- **King Abdullah University of Science and Technology**
5
-
6
- [[Project Website]](https://minigpt-4.github.io/) [[Paper]](MiniGPT_4.pdf) [Online Demo]
7
-
8
-
9
- ## Online Demo
10
-
11
- Chat with MiniGPT-4 around your images
12
-
13
-
14
- ## Examples
15
- | | |
16
- :-------------------------:|:-------------------------:
17
- ![find wild](examples/wop_2.png) | ![write story](examples/ad_2.png)
18
- ![solve problem](examples/fix_1.png) | ![write Poem](examples/rhyme_1.png)
19
-
20
-
21
-
22
-
23
-
24
- ## Abstract
25
- The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer.
26
- Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc.
27
- These advanced capabilities can be attributed to the use of a more advanced large language model.
28
- Furthermore, our method is computationally efficient, as we only train a projection layer using roughly 5 million aligned image-text pairs and an additional 3,500 carefully curated high-quality pairs.
29
-
30
-
31
-
32
-
33
-
34
-
35
-
36
-
37
- ## Getting Started
38
- ### Installation
39
-
40
- 1. Prepare the code and the environment
41
-
42
- Git clone our repository, creating a python environment and ativate it via the following command
43
-
44
- ```bash
45
- git clone https://github.com/Vision-CAIR/MiniGPT-4.git
46
- cd MiniGPT-4
47
- conda env create -f environment.yml
48
- conda activate minigpt4
49
- ```
50
-
51
-
52
- 2. Prepare the pretrained Vicuna weights
53
-
54
- The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
55
- Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights.
56
- The final weights would be in a single folder with the following structure:
57
-
58
- ```
59
- vicuna_weights
60
- ├── config.json
61
- ├── generation_config.json
62
- ├── pytorch_model.bin.index.json
63
- ├── pytorch_model-00001-of-00003.bin
64
- ...
65
- ```
66
-
67
- Then, set the path to the vicuna weight in the model config file
68
- [here](minigpt4/configs/models/minigpt4.yaml#L21) at Line 21.
69
-
70
- 3. Prepare the pretrained MiniGPT-4 checkpoint
71
-
72
- To play with our pretrained model, download the pretrained checkpoint
73
- [here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link).
74
- Then, set the path to the pretrained checkpoint in the evaluation config file
75
- in [eval_configs/minigpt4.yaml](eval_configs/minigpt4.yaml#L15) at Line 15.
76
-
77
-
78
-
79
-
80
-
81
- ### Launching Demo Locally
82
-
83
- Try out our demo [demo.py](app.py) with your images for on your local machine by running
84
-
85
- ```
86
- python demo.py --cfg-path eval_configs/minigpt4.yaml
87
- ```
88
-
89
-
90
-
91
-
92
-
93
- ### Training
94
- The training of MiniGPT-4 contains two-stage alignments.
95
- In the first stage, the model is trained using image-text pairs from Laion and CC datasets
96
- to align the vision and language model. To download and prepare the datasets, please check
97
- [here](dataset/readme.md).
98
- After the first stage, the visual features are mapped and can be understood by the language
99
- model.
100
- To launch the first stage training, run
101
-
102
- ```bash
103
- torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage1_laion.yaml
104
- ```
105
-
106
- In the second stage, we use a small high quality image-text pair dataset created by ourselves
107
- and convert it to a conversation format to further align MiniGPT-4.
108
- Our second stage dataset can be download from
109
- [here](https://drive.google.com/file/d/1RnS0mQJj8YU0E--sfH08scu5-ALxzLNj/view?usp=share_link).
110
- After the second stage alignment, MiniGPT-4 is able to talk about the image in
111
- a smooth way.
112
- To launch the second stage alignment, run
113
-
114
- ```bash
115
- torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage2_align.yaml
116
- ```
117
-
118
-
119
-
120
-
121
-
122
- ## Acknowledgement
123
-
124
- + [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
125
- + [Vicuna](https://github.com/lm-sys/FastChat)
126
-
127
-
128
- If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
129
- ```bibtex
130
- @misc{zhu2022minigpt4,
131
- title={MiniGPT-4: Enhancing the Vision-language Understanding with Advanced Large Language Models},
132
- author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
133
- year={2023},
134
- }
135
- ```
136
-
137
- ## License
138
- This repository is built on [Lavis](https://github.com/salesforce/LAVIS) with BSD 3-Clause License
139
- [BSD 3-Clause License](LICENSE.txt)
 
1
+ ---
2
+ title: MiniGPT
3
+ emoji: 🚀
4
+ colorFrom: purple
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 3.17.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: other
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference