nms05
/

Dinov2-SigLIP-Phi3-LoRA

Visual Question Answering

Model card Files Files and versions Community

Dinov2-SigLIP-Phi3-LoRA / README.md

nms05's picture

Create README.md

69e46f5 verified 9 months ago

|

752 Bytes

	---
	datasets:
	- liuhaotian/LLaVA-Instruct-150K
	- liuhaotian/LLaVA-CC3M-Pretrain-595K
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: visual-question-answering
	---
	# DinoV2-SigLIP-Phi3(LoRA) VLM

	* Vision Encoder - DinoV2 + SigLIP @384px resolution. [Why 2 vision encoders?](https://arxiv.org/abs/2401.06209)
	* Connector - MLP (Dino and SigLIP features are concatenated and then projected to Phi3 representation space)
	* Language Model - Phi3 + LoRA
	* Pre-train (Align) Dataset - LLaVA-CC3M-Pretrain-595K
	* Fine-tune (Instruction) Dataset - LLAVA-v1.5-Instruct + LRV-Instruct

	Scripts to build and train the models are available at [NMS05/DinoV2-SigLIP-Phi3-LoRA-VLM](https://github.com/NMS05/DinoV2-SigLIP-Phi3-LoRA-VLM).