xinyanghuang commited on
Commit
5d3ed66
β€’
1 Parent(s): 5417b9a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md CHANGED
@@ -1,3 +1,89 @@
1
  ---
2
  license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: visual-question-answering
4
  ---
5
+
6
+ # Building Your Own Multimodal Large Model from Scratch
7
+
8
+ For the Chinese version of the README, please refer to [δΈ­ζ–‡ζ–‡ζ‘£](README_zh.md).
9
+
10
+ ## Code Explanation πŸ’»
11
+
12
+ - **Data Preprocessing**: The relevant code is located in the `dataprocess` folder, with dataset-related code in the `dataset` folder. Data preprocessing mainly includes path merging, QA data concatenation, and token processing for feature insertion.
13
+ - **LLM Model**: We use Qwen-7B as the main model, with relevant code in the `qwen` folder. By overriding the `forward` method of `QWenModel`, we inject multimodal features.
14
+ - **Vision Model**: We use `CLIP_VIT` and `SIGLIP_VIT`, with relevant code in the `visual` folder, which also includes other backbone networks.
15
+ - **VLM Model**: The relevant code is in the `model.py` file within the `model` folder.
16
+
17
+ ## Datasets 🌏
18
+
19
+ We use multilingual datasets, primarily including the COCO2017 dataset and the AI Challenger image Chinese description dataset:
20
+ - The COCO dataset annotations use LLAVA's `detail_23k` and `complex_reasoning_77k`, which can effectively enhance the richness of the model's descriptions.
21
+ - The AI Challenger dataset uses the original annotations with fixed prompts.
22
+
23
+ ## Model Architecture πŸ€–
24
+
25
+ In the VLM, the vision part uses the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment, and employs a two-layer MLP for feature mapping. By overriding the `forward` method of `QWenModel`, the corresponding `image` tokens are replaced with visual features.
26
+
27
+ If you wish to modify the model architecture, please change [this part](https://github.com/xinyanghuang7/Basic-Vision-Language-Model/blob/main/train.py#L41).
28
+
29
+ ## How to Start Deployment πŸ”§
30
+
31
+ ### Download the Relevant Data
32
+
33
+ | AI Challenger | COCO | complex_reasoning_77k.json | detail_23k.json |
34
+ | --- | --- | --- | --- |
35
+ | [AI Challenger](https://tianchi.aliyun.com/dataset/145781) | [COCO 2017](http://images.cocodataset.org/zips/train2017.zip) | [complex_reasoning_77k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/resolve/main/complex_reasoning_77k.json) | [detail_23k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/resolve/main/detail_23k.json) |
36
+
37
+ Please store the datasets according to the paths specified in the [configuration file](https://github.com/xinyanghuang7/Basic-Vision-Language-Model/blob/main/dataprocess/config.yaml). Of course, the paths can be customized.
38
+
39
+ Please note that this path needs to be consistent with the [data/](https://github.com/xinyanghuang7/Basic-Vision-Language-Model/blob/main/train.py#L29) directory for the model to read the data correctly.
40
+
41
+ After downloading the data, use `process_image.py` for preprocessing.
42
+
43
+ ### Install the Runtime Environment
44
+
45
+ Use `pip install` to install the dependencies listed in `requirements.txt`:
46
+
47
+ ```shell
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ ### Start Training
52
+
53
+ The model training adopts a strategy where the image model is frozen, and the LLM is trained using the Lora method to reduce training pressure. The parameters to be trained include the visual feature mapping layer and the Lora parameters in the LLM. Since the mapping layer consists of untrained initialization parameters, a larger learning rate is set for the mapping layer compared to the Lora part to balance the optimization speed of the model parameters.
54
+
55
+ Run `train.sh` in the root directory, and you can configure the relevant parameters for experimentation.
56
+
57
+ ```shell
58
+ sh train.sh
59
+ ```
60
+
61
+ By following the above steps, you can start the training process and train the multimodal model.
62
+
63
+ The model weights will be saved in the `--output_dir` directory. This path can also be customized.
64
+
65
+ ### Testing the Model
66
+
67
+ Run `test.sh` in the root directory, and you can configure the relevant parameters for experimentation.
68
+
69
+ ```shell
70
+ sh test.sh
71
+ ```
72
+
73
+ The code will read images from the folder and perform Q&A.
74
+
75
+ ## References πŸ“š
76
+
77
+ Thanks to the following projects for their great work πŸ™Œ:
78
+
79
+ - https://github.com/WatchTower-Liu/VLM-learning/tree/main
80
+ - https://github.com/QwenLM/Qwen
81
+ - https://github.com/haotian-liu/LLaVA
82
+
83
+ ## Contact βœ‰
84
+
85
+ If you have any questions or ideas, feel free to contact me 😊:
86
+
87
+ hsinyanghuang7@gmail.com
88
+
89
+ I will respond as soon as I see the email!