--- language: - en tags: - llava - multimodal - qwen license: apache-2.0 --- # nanoLLaVA - Sub 1B Vision-Language Model **IMPORTANT**: **nanoLLaVA**-1.5 is out with a much better performance. Please find it [here](https://huggingface.co/qnguyen3/nanoLLaVA-1.5).
## Description nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. - **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) - **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |---------|--------|---------|-----------|------|-------------|-------------|------|--------| | Score | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | ## Training Data Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. ## Finetuning Code Coming Soon!!! ## Usage You can use with `transformers` with the following script: ```bash pip install -U transformers accelerate flash_attn ``` ```python import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import warnings # disable some warnings transformers.logging.set_verbosity_error() transformers.logging.disable_progress_bar() warnings.filterwarnings('ignore') # set device torch.set_default_device('cuda') # or 'cpu' # create model model = AutoModelForCausalLM.from_pretrained( 'qnguyen3/nanoLLaVA', torch_dtype=torch.float16, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained( 'qnguyen3/nanoLLaVA', trust_remote_code=True) # text prompt prompt = 'Describe this image in detail' messages = [ {"role": "user", "content": f'