File size: 5,223 Bytes
181722d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
**King Abdullah University of Science and Technology**
[[Project Website]](https://minigpt-4.github.io/) [[Paper]](MiniGPT_4.pdf) [Online Demo]
## Online Demo
Chat with MiniGPT-4 around your images
## Examples
| | |
:-------------------------:|:-------------------------:
![find wild](examples/wop_2.png) | ![write story](examples/ad_2.png)
![solve problem](examples/fix_1.png) | ![write Poem](examples/rhyme_1.png)
## Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer.
Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc.
These advanced capabilities can be attributed to the use of a more advanced large language model.
Furthermore, our method is computationally efficient, as we only train a projection layer using roughly 5 million aligned image-text pairs and an additional 3,500 carefully curated high-quality pairs.
## Getting Started
### Installation
1. Prepare the code and the environment
Git clone our repository, creating a python environment and ativate it via the following command
```bash
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4
```
2. Prepare the pretrained Vicuna weights
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights.
The final weights would be in a single folder with the following structure:
```
vicuna_weights
βββ config.json
βββ generation_config.json
βββ pytorch_model.bin.index.json
βββ pytorch_model-00001-of-00003.bin
...
```
Then, set the path to the vicuna weight in the model config file
[here](minigpt4/configs/models/minigpt4.yaml#L21) at Line 21.
3. Prepare the pretrained MiniGPT-4 checkpoint
To play with our pretrained model, download the pretrained checkpoint
[here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link).
Then, set the path to the pretrained checkpoint in the evaluation config file
in [eval_configs/minigpt4.yaml](eval_configs/minigpt4.yaml#L15) at Line 15.
### Launching Demo Locally
Try out our demo [demo.py](app.py) with your images for on your local machine by running
```
python demo.py --cfg-path eval_configs/minigpt4.yaml
```
### Training
The training of MiniGPT-4 contains two-stage alignments.
In the first stage, the model is trained using image-text pairs from Laion and CC datasets
to align the vision and language model. To download and prepare the datasets, please check
[here](dataset/readme.md).
After the first stage, the visual features are mapped and can be understood by the language
model.
To launch the first stage training, run
```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage1_laion.yaml
```
In the second stage, we use a small high quality image-text pair dataset created by ourselves
and convert it to a conversation format to further align MiniGPT-4.
Our second stage dataset can be download from
[here](https://drive.google.com/file/d/1RnS0mQJj8YU0E--sfH08scu5-ALxzLNj/view?usp=share_link).
After the second stage alignment, MiniGPT-4 is able to talk about the image in
a smooth way.
To launch the second stage alignment, run
```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage2_align.yaml
```
## Acknowledgement
+ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
+ [Vicuna](https://github.com/lm-sys/FastChat)
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
```bibtex
@misc{zhu2022minigpt4,
title={MiniGPT-4: Enhancing the Vision-language Understanding with Advanced Large Language Models},
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
year={2023},
}
```
## License
This repository is built on [Lavis](https://github.com/salesforce/LAVIS) with BSD 3-Clause License
[BSD 3-Clause License](LICENSE.txt)
|