xinyanghuang
/

Basic-Visual-Language-Model

Visual Question Answering

Model card Files Files and versions Community

Basic-Visual-Language-Model / README.md

xinyanghuang's picture

Update README.md

c415380 verified about 1 month ago

|

history blame contribute delete

No virus

1.12 kB

	---
	license: apache-2.0
	pipeline_tag: visual-question-answering
	---

	# Building Your Own Multimodal Large Model from Scratch

	For the Chinese version of the README, please refer to [中文文档](README_zh.md).

	## Model Architecture 🤖

	In the VLM (Visual Language Model), the visual component utilizes the `CLIP` or `SIGLIP` models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the `forward` method of the `QWenModel`, the corresponding `image` tokens are replaced with visual features.

	## GitHub Repository 🏠

	The code for running the model can be found at [Basic-Visual-Language-Model](https://github.com/xinyanghuang7/Basic-Visual-Language-Model/tree/main).

	## References 📚

	Special thanks to the following projects for their great work 🙌:

	- https://github.com/WatchTower-Liu/VLM-learning/tree/main
	- https://github.com/QwenLM/Qwen
	- https://github.com/haotian-liu/LLaVA

	## Contact ✉

	If you have any questions or ideas, feel free to reach out to me 😊:

	hsinyanghuang7@gmail.com

	I will respond as soon as I see your email!