adamo1139/Qwen2-VL-7B-Sydney

Model Description

Qwen 2 VL 7B Sydney - Optimizing Vision Language Models for engagement and positivity.

Have you ever pasted a picture of your dog or cat into a Vision Language Model only for the model to give you a description of the image without complimenting on the looks of your fluffer?
Well, this model will use every chance it gets to compliment your adorable sweetheart.

It's been trained on around 60000 samples of synthetic data generated by NousResearch/Hermes-3-Llama-3.1-8B. Dataset was converted from liuhaotian/LLaVA-Instruct-150K. Dataset is available here.

I am attempting to learn about finetuning Qwen 2 VL 7B and this was just a result of my tinkering over a weekend.

Dataset Creation details

I ran Hermes 3 8B in Aphrodite-Engine locally and used a Python script to go through the LLaVA 150K Instruct dataset and for each sample, send a request to the model to modify the JSON sample so that output is more energetic. I used 6-shot prompt with bad samples coming from a generic LLM and good samples coming from FPHam/Llama-3-8B-Sydney. After running through about half of the dataset I noticed an error in one of my examples and upon fixing it and modifying the prompt a bit I noticed that the generation quality deteriorated and 30% of responses I was getting back didn't pass JSON validation. I settled on using the ~60000 samples that were already processed fine. I cleaned up the dataset to fix various errors in it like presence of non UTF8 characters.

Script used for creating the dataset is here.

Inference

I uploaded the script for inference here This script is doing inference on this model and also normal Qwen 2 VL Instruct checkpoint. Script is based on the simple Qwen 2 VL Gradio inference project published here Qwen2 VL doesn't quant well, so you will need VRAM to load in the 16-bit checkpoint. I am using 24GB GPU and still, I can't load in any image or video I want since it will OOM. Inference should work fine on both Windows and Linux. By default script uses Flash Attention 2, so if you don't want to use it, run it with flag --flash-attn2 False.

Technical details

Model was trained in LLaMa-Factory on a system with RTX 3090 Ti with unsloth on context length of 2000 with LoRA rank 32, alpha 32 and LoRa+ ratio of 4. Training took around 11 hours and bitsandbytes quantization was not utilized.

bf16: true
cutoff_len: 2000
dataset: sydney
dataset_dir: data
ddp_timeout: 180000000
do_train: true
finetuning_type: lora
flash_attn: auto
gradient_accumulation_steps: 16
include_num_input_tokens_seen: true
learning_rate: 5.0e-05
logging_steps: 1
lora_alpha: 32
lora_dropout: 0
lora_rank: 32
lora_target: all
loraplus_lr_ratio: 4
lr_scheduler_type: cosine
max_grad_norm: 1.0
max_samples: 160000
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
num_train_epochs: 1.0
optim: adamw_8bit
output_dir: saves/Qwen2-VL-7B-Instruct/lora/train_2024-10-05-18-44-10-2
packing: true
per_device_train_batch_size: 1
plot_loss: true
preprocessing_num_workers: 16
report_to: none
save_steps: 200
stage: sft
template: qwen2_vl
train_on_prompt: true
use_unsloth: true
warmup_steps: 25

Loss drops quickly and then stays basically flat, I am not sure why and this suggest some of the hyperparameters might have been set incorrectly or loss works differently on vision language models.