--- license: apache-2.0 pipeline_tag: visual-question-answering --- # Building Your Own Multimodal Large Model from Scratch For the Chinese version of the README, please refer to [δΈ­ζ–‡ζ–‡ζ‘£](README_zh.md). ## Model Architecture πŸ€– In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features. ## GitHub Repository 🏠 The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main). ## References πŸ“š Special thanks to the following projects for their great work πŸ™Œ: - https://github.com/WatchTower-Liu/VLM-learning/tree/main - https://github.com/QwenLM/Qwen - https://github.com/haotian-liu/LLaVA ## Contact βœ‰ If you have any questions or ideas, feel free to reach out to me 😊: hsinyanghuang7@gmail.com I will respond as soon as I see your email!