File size: 3,635 Bytes
109aaf5 40ccd0f 8370b8c 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 cece394 40ccd0f 109aaf5 40ccd0f e855f7d 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 40ccd0f 109aaf5 cece394 109aaf5 40ccd0f cece394 40ccd0f 8370b8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b
pipeline_tag: image-text-to-text
---
# OLA-VLM-CLIP-ViT-Llama3-8b Model Card
OLA-VLM distills target visual information into the intermediate representations of the LLM from a set of target encoders. It adopts a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective, resulting in a vision-centric approach to training the Multimodal Large Language Model.
- **GitHub Repo:** [https://github.com/SHI-Labs/OLA-VLM](https://github.com/SHI-Labs/OLA-VLM)
- **Project Page:** [https://praeclarumjj3.github.io/ola_vlm/](https://praeclarumjj3.github.io/ola_vlm/)
<p align="center">
<img src="https://praeclarumjj3.github.io/ola_vlm/teaser.png" width="90%" class="center"/>
</p>
## Get Started with the Model
Clone the repository and follow the [setup instructions](https://github.com/SHI-Labs/OLA-VLM#installation-instructions):
```bash
git lfs install
git clone https://github.com/SHI-Labs/OLA-VLM
cd OLA-VLM
```
After setup, you can use OLA-VLM with the following code:
```python
import gradio as gr
import os
import torch
import numpy as np
from ola_vlm.constants import DEFAULT_IMAGE_TOKEN
from ola_vlm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from ola_vlm.conversation import conv_templates, SeparatorStyle
from ola_vlm.model.builder import load_pretrained_model
from ola_vlm.mm_utils import tokenizer_image_token, get_model_name_from_path, process_images
model_path = "shi-labs/OLA-VLM-CLIP-ViT-Llama3-8b"
conv_mode = "llava_llama_3"
image_path = "/path/to/OLA-VLM/assets/pb.jpg"
input_prompt = "Describe this image."
# load model
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)
# prepare prompt
input_prompt = DEFAULT_IMAGE_TOKEN + '\n' + input_prompt
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], input_prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# load and preprocess image
image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0]
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
input_ids = input_ids.to(device='cuda', non_blocking=True)
image_tensor = image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True)
# run inference
with torch.inference_mode():
output_ids = model.generate(
input_ids.unsqueeze(0),
images=image_tensor.unsqueeze(0),
image_sizes=[image.size],
do_sample=True,
temperature=0.2,
top_p=0.5,
num_beams=1,
max_new_tokens=256,
use_cache=True)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Image:{image_path} \nPrompt:{input_prompt} \nOutput:{outputs}")
```
For more information, please refer to [https://github.com/SHI-Labs/OLA-VLM](https://github.com/SHI-Labs/OLA-VLM).
## Citation
If you found our work useful in your research, please consider starring ⭐ us on [GitHub](https://github.com/SHI-Labs/OLA-VLM) and citing 📚 us in your research!
```
@article{jain2024ola_vlm,
title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
journal={arXiv},
year={2024}
}
``` |