MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models

Deyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution

King Abdullah University of Science and Technology

[Project Website] [Paper] [Online Demo]

Online Demo

Chat with MiniGPT-4 around your images

Examples

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. These advanced capabilities can be attributed to the use of a more advanced large language model. Furthermore, our method is computationally efficient, as we only train a projection layer using roughly 5 million aligned image-text pairs and an additional 3,500 carefully curated high-quality pairs.

Getting Started

Installation

Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4

Prepare the pretrained Vicuna weights

The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. Please refer to their instructions here to obtaining the weights. The final weights would be in a single folder with the following structure:

vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...

Then, set the path to the vicuna weight in the model config file here at Line 21.

Prepare the pretrained MiniGPT-4 checkpoint

To play with our pretrained model, download the pretrained checkpoint here. Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4.yaml at Line 15.

Launching Demo Locally

Try out our demo demo.py with your images for on your local machine by running

python demo.py --cfg-path eval_configs/minigpt4.yaml

Training

The training of MiniGPT-4 contains two-stage alignments. In the first stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check here. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage1_laion.yaml

In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. Our second stage dataset can be download from here. After the second stage alignment, MiniGPT-4 is able to talk about the image in a smooth way. To launch the second stage alignment, run

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage2_align.yaml

Acknowledgement

BLIP2
Vicuna

If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:

@misc{zhu2022minigpt4,
      title={MiniGPT-4: Enhancing the Vision-language Understanding with Advanced Large Language Models}, 
      author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
      year={2023},
}

License

This repository is built on Lavis with BSD 3-Clause License BSD 3-Clause License