--- license: mit datasets: - liuhaotian/LLaVA-Instruct-150K - liuhaotian/LLaVA-Pretrain language: - en pipeline_tag: visual-question-answering --- # Model Card for Model ID This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi). ## Model Details 1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2) 2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) 4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) 5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) 6. Finetuned Model: [RaviNaik/Llava-Phi2](https://huggingface.co/RaviNaik/Llava-Phi2) ### Model Sources - **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi) - **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330) - **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2) ## How to Get Started with the Model Use the code below to get started with the model. 1. Clone this repository and navigate to llava-phi folder ```bash git clone https://github.com/zhuyiche/llava-phi.git cd llava-phi ``` 2. Install Package ```bash conda create -n llava_phi python=3.10 -y conda activate llava_phi pip install --upgrade pip # enable PEP 660 support pip install -e . ``` 3. Run the Model ```bash python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \ --image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \ --query="How many people are there in the image?" ``` ### Acknowledgement This implementation is based on wonderful work done by: \ [LlaVA-Phi](https://github.com/zhuyiche/llava-phi) \ [Llava](https://github.com/haotian-liu/LLaVA) \ [Phi2](https://huggingface.co/microsoft/phi-2)