File size: 1,122 Bytes
c415380
 
 
 
5d3ed66
 
 
 
 
 
 
c415380
5d3ed66
c415380
5d3ed66
c415380
5d3ed66
 
 
c415380
5d3ed66
 
 
 
 
 
 
c415380
5d3ed66
 
 
c415380
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
license: apache-2.0
pipeline_tag: visual-question-answering
---

# Building Your Own Multimodal Large Model from Scratch

For the Chinese version of the README, please refer to [δΈ­ζ–‡ζ–‡ζ‘£](README_zh.md).

## Model Architecture πŸ€–

In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features.

## GitHub Repository 🏠

The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main).

## References πŸ“š

Special thanks to the following projects for their great work πŸ™Œ:

- https://github.com/WatchTower-Liu/VLM-learning/tree/main
- https://github.com/QwenLM/Qwen
- https://github.com/haotian-liu/LLaVA

## Contact βœ‰

If you have any questions or ideas, feel free to reach out to me 😊:

hsinyanghuang7@gmail.com

I will respond as soon as I see your email!