|
--- |
|
license: apache-2.0 |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
# Building Your Own Multimodal Large Model from Scratch |
|
|
|
For the Chinese version of the README, please refer to [δΈζζζ‘£](README_zh.md). |
|
|
|
## Model Architecture π€ |
|
|
|
In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features. |
|
|
|
## GitHub Repository π |
|
|
|
The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main). |
|
|
|
## References π |
|
|
|
Special thanks to the following projects for their great work π: |
|
|
|
- https://github.com/WatchTower-Liu/VLM-learning/tree/main |
|
- https://github.com/QwenLM/Qwen |
|
- https://github.com/haotian-liu/LLaVA |
|
|
|
## Contact β |
|
|
|
If you have any questions or ideas, feel free to reach out to me π: |
|
|
|
hsinyanghuang7@gmail.com |
|
|
|
I will respond as soon as I see your email! |