MiniGPTv2 Project
Overview
MiniGPTv2 is a multimodal large language model that combines vision and language capabilities. This repository contains the implementation of MiniGPTv2 fine-tuned on facial emotion recognition and detailed image understanding tasks.
Model Architecture Base Architecture: MiniGPTv2 LLM Backbone: Llama-2-7b-chat Image Size: 448ร448 Max Text Length: 3072 tokens LoRA Configuration: r=64, alpha=16 Gradient Checkpointing: Enabled for the vision encoder, disabled for LLM Training Configuration Training Checkpoint: Epoch 88 (56,320 steps) Steps per Epoch: 640 Batch Size: 1 (with gradient accumulation) Gradient Accumulation Steps: 16 Learning Rate: Initial: 3e-5 Minimum: 1e-6 Warmup: 1e-6 LR Schedule: Linear warmup with cosine decay Warmup Steps: 1000 Weight Decay: 0.05 Mixed Precision Training: Enabled Dataset Composition The model was trained on a mixture of datasets with the following sampling ratios:
ShareGPT Detail: 30% General visual conversation data GPT4Vision Face Detail: 10% Facial analysis and description data Realistic Emotions Detail: 20% Emotion recognition and interpretation data Usage Requirements
Text Only torch>=2.0.0 transformers>=4.28.0 timm fairscale accelerate Loading the Model
Python from minigptv2.model import MiniGPTv2
Initialize the model
model = MiniGPTv2.from_pretrained( llama_model_path="/path/to/Llama-2-7b-chat-hf", checkpoint_path="/path/to/minigptv2_checkpoint.pth", image_size=448, max_txt_len=3072 )
Set to evaluation mode
model.eval() Inference
Python from PIL import Image import torch
Load image
image = Image.open("example.jpg").convert("RGB")
Process input
response = model.generate( image=image, prompt="What emotions is this person expressing?", max_new_tokens=512 )
print(response) Training To continue training from the epoch 88 checkpoint:
Bash python train.py --config /path/to/config.yaml --resume_ckpt_path /path/to/epoch88_checkpoint.pth Evaluation
Bash python evaluate.py --config /path/to/eval_config.yaml --checkpoint /path/to/epoch88_checkpoint.pth License [Specify license information]
Citation
Text Only [Citation information for MiniGPTv2 and any relevant papers] Acknowledgements This project builds upon the MiniGPTv2 architecture and utilizes the Llama-2-7b-chat model. We thank the original authors for their contributions to the field.