--- language: - en tags: - llava - multimodal - qwen license: apache-2.0 --- # nanoLLaVA - Sub 1B Vision-Language Model **IMPORTANT**: **nanoLLaVA**-1.5 is out with a much better performance. Please find it [here](https://huggingface.co/qnguyen3/nanoLLaVA-1.5).

Logo

## Description nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. - **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) - **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |---------|--------|---------|-----------|------|-------------|-------------|------|--------| | Score | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | ## Training Data Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. ## Finetuning Code Coming Soon!!! ## Usage You can use with `transformers` with the following script: ```bash pip install -U transformers accelerate flash_attn ``` ```python import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import warnings # disable some warnings transformers.logging.set_verbosity_error() transformers.logging.disable_progress_bar() warnings.filterwarnings('ignore') # set device torch.set_default_device('cuda') # or 'cpu' # create model model = AutoModelForCausalLM.from_pretrained( 'qnguyen3/nanoLLaVA', torch_dtype=torch.float16, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained( 'qnguyen3/nanoLLaVA', trust_remote_code=True) # text prompt prompt = 'Describe this image in detail' messages = [ {"role": "user", "content": f'\n{prompt}'} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(text) text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('')] input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0) # image, sample images can be found in images folder image = Image.open('/path/to/image.png') image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) # generate output_ids = model.generate( input_ids, images=image_tensor, max_new_tokens=2048, use_cache=True)[0] print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) ``` ## Prompt Format The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`: ``` <|im_start|>system Answer the question<|im_end|><|im_start|>user What is the picture about?<|im_end|><|im_start|>assistant ``` --- | Image | Example | |--------------------------------------|---------------------------------------------------------------------------------------------| | ![small](example_1.png) | **What is the text saying?**
"Small but mighty".
**How does the text correlate to the context of the image?**
The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar. | --- Model is trained using a modified version from [Bunny](https://github.com/BAAI-DCAI/Bunny/tree/main/bunny)