MiniGPTv2 Project

Overview

MiniGPTv2 is a multimodal large language model that combines vision and language capabilities. This repository contains the implementation of MiniGPTv2 fine-tuned on facial emotion recognition and detailed image understanding tasks.

Model Architecture Base Architecture: MiniGPTv2 LLM Backbone: Llama-2-7b-chat Image Size: 448×448 Max Text Length: 3072 tokens LoRA Configuration: r=64, alpha=16 Gradient Checkpointing: Enabled for the vision encoder, disabled for LLM Training Configuration Training Checkpoint: Epoch 88 (56,320 steps) Steps per Epoch: 640 Batch Size: 1 (with gradient accumulation) Gradient Accumulation Steps: 16 Learning Rate: Initial: 3e-5 Minimum: 1e-6 Warmup: 1e-6 LR Schedule: Linear warmup with cosine decay Warmup Steps: 1000 Weight Decay: 0.05 Mixed Precision Training: Enabled Dataset Composition The model was trained on a mixture of datasets with the following sampling ratios:

ShareGPT Detail: 30% General visual conversation data GPT4Vision Face Detail: 10% Facial analysis and description data Realistic Emotions Detail: 20% Emotion recognition and interpretation data Usage Requirements

Text Only torch>=2.0.0 transformers>=4.28.0 timm fairscale accelerate Loading the Model

Python from minigptv2.model import MiniGPTv2

Initialize the model

model = MiniGPTv2.from_pretrained( llama_model_path="/path/to/Llama-2-7b-chat-hf", checkpoint_path="/path/to/minigptv2_checkpoint.pth", image_size=448, max_txt_len=3072 )

Set to evaluation mode

model.eval() Inference

Python from PIL import Image import torch

Load image

image = Image.open("example.jpg").convert("RGB")

Process input

response = model.generate( image=image, prompt="What emotions is this person expressing?", max_new_tokens=512 )

print(response) Training To continue training from the epoch 88 checkpoint:

Bash python train.py --config /path/to/config.yaml --resume_ckpt_path /path/to/epoch88_checkpoint.pth Evaluation

Bash python evaluate.py --config /path/to/eval_config.yaml --checkpoint /path/to/epoch88_checkpoint.pth License [Specify license information]

Citation

Text Only [Citation information for MiniGPTv2 and any relevant papers] Acknowledgements This project builds upon the MiniGPTv2 architecture and utilizes the Llama-2-7b-chat model. We thank the original authors for their contributions to the field.

ValerianFourel
/

FaceVLM