File size: 7,365 Bytes
96fe658 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
# Best Practices for Rapidly Training Vision-Language (VL) Models
This document provides best practices for quickly training vision-language (VL) models from scratch.
Model Links
- [Qwen2.5-VL-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)
- [Qwen3-8B](https://www.modelscope.cn/models/Qwen/Qwen3-8B)
Trained Model Link
- [Simple-VL-8B](https://www.modelscope.cn/models/swift/Simple-VL-8B/summary)
The training workflow builds upon the Qwen2.5-VL-7B-Instruct model architecture by replacing its internal large language model (LLM) component with the weights from Qwen3-8B , thereby enhancing the model's visual understanding capabilities. The process involves the following steps:
1. Modify the original model’s configuration file config.json to align with Qwen3-8B.
2. Initialize and load new model weights, saving them as a new model.
3. Fine-tune the new model in two stages:
1. Stage 1 : Train only the vision-to-language alignment module (aligner), freezing the ViT and LLM components.
2. Stage 2 : Unfreeze all modules and perform joint fine-tuning to improve overall performance.
## Model Modification
### Config File (config.json) Update
Due to structural differences between Qwen2.5-7B-Instruct and Qwen3-8B (e.g., number of layers, hidden dimensions), create a new config.json based on the Qwen2.5-VL-7B-Instruct config and update the following parameters to match Qwen3-8B:
```
Modified Parameters
1. hidden_size 3584->4096
2. intermediate_size: 18944->12288
3. num_attention_heads: 28->32
4. num_key_value_heads: 4->8
5. num_hidden_layers: 28->36
6. vocab_size:152064->151936
7. max_window_layers:28->36
8. out_hidden_size: 3584->4096
Newly Added Parameter
1. head_dim: 128
```
### Model Weight Initialization and Replacement
Use the following Python script to initialize, replace, and save the model weights:
```python
import torch
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoModelForCausalLM, AutoConfig
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLPatchMerger, Qwen2_5_VLModel
from accelerate import Accelerator
# Load original VL model and Qwen3-8B model
qwen2_5_vl_7b_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
device_map="cuda",
torch_dtype=torch.bfloat16
)
device = qwen2_5_vl_7b_model.device
qwen3_8b_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
device_map=device,
torch_dtype=torch.bfloat16
)
# Load configurations
old_config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
new_config = AutoConfig.from_pretrained("/path/to/new_config_dir") # Path to new config directory
new_visual_config = new_config.vision_config
# Replace merger (aligner) layer
new_merger = Qwen2_5_VLPatchMerger(
dim=new_visual_config.out_hidden_size,
context_dim=new_visual_config.hidden_size,
spatial_merge_size=new_visual_config.spatial_merge_size,
).to(device).to(torch.bfloat16)
qwen2_5_vl_7b_model.visual.merger = new_merger
# Replace LLM part of the VL model
new_llm_model = Qwen2_5_VLModel(new_config).to(device).to(torch.bfloat16)
for name, param in qwen3_8b_model.model.named_parameters():
if name in new_llm_model.state_dict():
new_llm_model.state_dict()[name].copy_(param)
qwen2_5_vl_7b_model.model = new_llm_model
qwen2_5_vl_7b_model.lm_head = qwen3_8b_model.lm_head
# Save modified model
accelerator = Accelerator()
accelerator.save_model(
model=qwen2_5_vl_7b_model,
save_directory="/path/to/save/Qwen3-VL-Model",
max_shard_size="4GB",
safe_serialization=True
)
```
After saving the weights, copy all files from the original Qwen2.5-VL-7B-Instruct model folder, except for the model weights(including `model.safetensors.index.json`), to the new model weights folder, and replace config.json with the newly modified config.json file.
## Training
To simplify the process, we skip pre-training and proceed directly to supervised fine-tuning (SFT). The training is divided into two stages:
### Stage 1: Train Aligner Layer
Train only the vision-to-language alignment module while freezing the ViT and LLM parts:
```bash
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
NPROC_PER_NODE=8 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model /path/to/new_vl_model \
--model_type qwen2_5_vl \
--train_type full \
--dataset xxx \
--split_dataset_ratio 0.01 \
--torch_dtype bfloat16 \
--attn_impl flash_attn \
--freeze_vit true \
--freeze_llm true \
--freeze_aligner false \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--learning_rate 5e-6 \
--gradient_accumulation_steps 8 \
--eval_steps -1 \
--save_steps 1000 \
--save_total_limit 10 \
--logging_steps 5 \
--max_length 8192 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 8 \
--deepspeed zero2
```
### Stage 2: Full Model Training
Unfreeze all modules and jointly train to enhance the model's visual understanding:
```bash
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
NPROC_PER_NODE=8 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
--model /path/to/stage1_checkpoint \
--model_type qwen2_5_vl \
--train_type full \
--dataset xxx \
--split_dataset_ratio 0.01 \
--torch_dtype bfloat16 \
--attn_impl flash_attn \
--freeze_vit false \
--freeze_llm false \
--freeze_aligner false \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--learning_rate 5e-6 \
--gradient_accumulation_steps 8 \
--eval_steps -1 \
--save_steps 1000 \
--save_total_limit 10 \
--logging_steps 5 \
--max_length 8192 \
--output_dir output \
--warmup_ratio 0.05 \
--dataloader_num_workers 4 \
--dataset_num_proc 8 \
--deepspeed zero2
```
## Inference / Deployment / Evaluation
### Inference
Perform inference using `swift infer`:
```bash
swift infer \
--model /path/to/stage2_checkpoint
```
### Deoloyment
Accelerate model serving with vLLM:
```bash
CUDA_VISIBLE_DEVICES=0 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
swift deploy \
--model /path/to/stage2_checkpoint \
--infer_backend vllm \
--vllm_gpu_memory_utilization 0.9 \
--vllm_max_model_len 8192 \
--max_new_tokens 2048 \
--vllm_limit_mm_per_prompt '{"image": 5, "video": 2}' \
--served_model_name Qwen3-VL
```
### Evaluation
Evaluate the trained VL model using [EvalScope](https://github.com/modelscope/evalscope/).
Example Evaluation Using MMMU Benchmark
```python
from evalscope import TaskConfig, run_task
task_cfg_dict = TaskConfig(
work_dir='outputs',
eval_backend='VLMEvalKit',
eval_config={
'data': ['MMMU_DEV_VAL'],
'mode': 'all',
'model': [
{
'api_base': 'http://localhost:8000/v1/chat/completions',
'key': 'EMPTY',
'name': 'CustomAPIModel',
'temperature': 0.6,
'type': 'Qwen3-VL',
'img_size': -1,
'video_llm': False,
'max_tokens': 512,
}
],
'reuse': False,
'nproc': 64,
'judge': 'exact_matching'
},
)
run_task(task_cfg=task_cfg_dict)
```
|