MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
Deyao Zhu* (On Job Market!), Jun Chen* (On Job Market!), Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. *Equal Contribution
King Abdullah University of Science and Technology
[Project Website] [Paper] [Online Demo]
Online Demo
Chat with MiniGPT-4 around your images
Examples
Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. These advanced capabilities can be attributed to the use of a more advanced large language model. Furthermore, our method is computationally efficient, as we only train a projection layer using roughly 5 million aligned image-text pairs and an additional 3,500 carefully curated high-quality pairs.
Getting Started
Installation
- Prepare the code and the environment
Git clone our repository, creating a python environment and ativate it via the following command
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4
- Prepare the pretrained Vicuna weights
The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B. Please refer to their instructions here to obtaining the weights. The final weights would be in a single folder with the following structure:
vicuna_weights
βββ config.json
βββ generation_config.json
βββ pytorch_model.bin.index.json
βββ pytorch_model-00001-of-00003.bin
...
Then, set the path to the vicuna weight in the model config file here at Line 21.
- Prepare the pretrained MiniGPT-4 checkpoint
To play with our pretrained model, download the pretrained checkpoint here. Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4.yaml at Line 15.
Launching Demo Locally
Try out our demo demo.py with your images for on your local machine by running
python demo.py --cfg-path eval_configs/minigpt4.yaml
Training
The training of MiniGPT-4 contains two-stage alignments. In the first stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check here. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage1_laion.yaml
In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. Our second stage dataset can be download from here. After the second stage alignment, MiniGPT-4 is able to talk about the image in a smooth way. To launch the second stage alignment, run
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage2_align.yaml
Acknowledgement
If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
@misc{zhu2022minigpt4,
title={MiniGPT-4: Enhancing the Vision-language Understanding with Advanced Large Language Models},
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
year={2023},
}
License
This repository is built on Lavis with BSD 3-Clause License BSD 3-Clause License