license: apache-2.0
pipeline_tag: visual-question-answering
Building Your Own Multimodal Large Model from Scratch
For the Chinese version of the README, please refer to δΈζζζ‘£.
Code Explanation π»
- Data Preprocessing: The relevant code is located in the
dataprocess
folder, with dataset-related code in thedataset
folder. Data preprocessing mainly includes path merging, QA data concatenation, and token processing for feature insertion. - LLM Model: We use Qwen-7B as the main model, with relevant code in the
qwen
folder. By overriding theforward
method ofQWenModel
, we inject multimodal features. - Vision Model: We use
CLIP_VIT
andSIGLIP_VIT
, with relevant code in thevisual
folder, which also includes other backbone networks. - VLM Model: The relevant code is in the
model.py
file within themodel
folder.
Datasets π
We use multilingual datasets, primarily including the COCO2017 dataset and the AI Challenger image Chinese description dataset:
- The COCO dataset annotations use LLAVA's
detail_23k
andcomplex_reasoning_77k
, which can effectively enhance the richness of the model's descriptions. - The AI Challenger dataset uses the original annotations with fixed prompts.
Model Architecture π€
In the VLM, the vision part uses the CLIP
or SIGLIP
models, which have already achieved preliminary semantic alignment, and employs a two-layer MLP for feature mapping. By overriding the forward
method of QWenModel
, the corresponding image
tokens are replaced with visual features.
If you wish to modify the model architecture, please change this part.
How to Start Deployment π§
Download the Relevant Data
AI Challenger | COCO | complex_reasoning_77k.json | detail_23k.json |
---|---|---|---|
AI Challenger | COCO 2017 | complex_reasoning_77k.json | detail_23k.json |
Please store the datasets according to the paths specified in the configuration file. Of course, the paths can be customized.
Please note that this path needs to be consistent with the data/ directory for the model to read the data correctly.
After downloading the data, use process_image.py
for preprocessing.
Install the Runtime Environment
Use pip install
to install the dependencies listed in requirements.txt
:
pip install -r requirements.txt
Start Training
The model training adopts a strategy where the image model is frozen, and the LLM is trained using the Lora method to reduce training pressure. The parameters to be trained include the visual feature mapping layer and the Lora parameters in the LLM. Since the mapping layer consists of untrained initialization parameters, a larger learning rate is set for the mapping layer compared to the Lora part to balance the optimization speed of the model parameters.
Run train.sh
in the root directory, and you can configure the relevant parameters for experimentation.
sh train.sh
By following the above steps, you can start the training process and train the multimodal model.
The model weights will be saved in the --output_dir
directory. This path can also be customized.
Testing the Model
Run test.sh
in the root directory, and you can configure the relevant parameters for experimentation.
sh test.sh
The code will read images from the folder and perform Q&A.
References π
Thanks to the following projects for their great work π:
- https://github.com/WatchTower-Liu/VLM-learning/tree/main
- https://github.com/QwenLM/Qwen
- https://github.com/haotian-liu/LLaVA
Contact β
If you have any questions or ideas, feel free to contact me π:
I will respond as soon as I see the email!