File size: 3,635 Bytes
109aaf5
 
40ccd0f
 
 
 
8370b8c
40ccd0f
109aaf5
 
40ccd0f
109aaf5
40ccd0f
109aaf5
cece394
40ccd0f
109aaf5
40ccd0f
e855f7d
40ccd0f
109aaf5
40ccd0f
109aaf5
40ccd0f
109aaf5
40ccd0f
 
 
 
 
109aaf5
40ccd0f
109aaf5
40ccd0f
 
 
 
 
109aaf5
40ccd0f
109aaf5
40ccd0f
 
 
 
109aaf5
40ccd0f
 
 
 
109aaf5
40ccd0f
 
 
109aaf5
40ccd0f
 
109aaf5
40ccd0f
 
 
 
109aaf5
40ccd0f
 
 
 
109aaf5
40ccd0f
 
109aaf5
40ccd0f
 
 
 
 
 
 
 
 
 
 
 
109aaf5
40ccd0f
 
 
109aaf5
40ccd0f
109aaf5
40ccd0f
109aaf5
cece394
109aaf5
40ccd0f
 
cece394
40ccd0f
 
 
 
8370b8c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- shi-labs/pretrain_dsg_OLA-VLM-CLIP-ViT-Llama3-8b
pipeline_tag: image-text-to-text
---

# OLA-VLM-CLIP-ViT-Llama3-8b Model Card

OLA-VLM distills target visual information into the intermediate representations of the LLM  from a set of target encoders. It adopts a predictive embedding optimization approach at selected LLM layers during training to minimize the embedding losses along with the next token prediction (NTP) objective,  resulting in a vision-centric approach to training the Multimodal Large Language Model.

- **GitHub Repo:** [https://github.com/SHI-Labs/OLA-VLM](https://github.com/SHI-Labs/OLA-VLM)
- **Project Page:** [https://praeclarumjj3.github.io/ola_vlm/](https://praeclarumjj3.github.io/ola_vlm/)

<p align="center">
    <img src="https://praeclarumjj3.github.io/ola_vlm/teaser.png" width="90%" class="center"/>
</p>

## Get Started with the Model

Clone the repository and follow the [setup instructions](https://github.com/SHI-Labs/OLA-VLM#installation-instructions):

```bash
git lfs install
git clone https://github.com/SHI-Labs/OLA-VLM
cd OLA-VLM
```

After setup, you can use OLA-VLM with the following code:

```python
import gradio as gr
import os
import torch
import numpy as np

from ola_vlm.constants import DEFAULT_IMAGE_TOKEN

from ola_vlm.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from ola_vlm.conversation import conv_templates, SeparatorStyle
from ola_vlm.model.builder import load_pretrained_model
from ola_vlm.mm_utils import tokenizer_image_token, get_model_name_from_path, process_images

model_path = "shi-labs/OLA-VLM-CLIP-ViT-Llama3-8b"
conv_mode = "llava_llama_3"
image_path = "/path/to/OLA-VLM/assets/pb.jpg"
input_prompt = "Describe this image."

# load model
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)

# prepare prompt
input_prompt = DEFAULT_IMAGE_TOKEN + '\n' + input_prompt

conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], input_prompt)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# load and preprocess image
image = Image.open(image_path).convert('RGB')
image_tensor = process_images([image], image_processor, model.config)[0]
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')

input_ids = input_ids.to(device='cuda', non_blocking=True)
image_tensor = image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True)

# run inference
with torch.inference_mode():
    output_ids = model.generate(
        input_ids.unsqueeze(0),
        images=image_tensor.unsqueeze(0),
        image_sizes=[image.size],
        do_sample=True,
        temperature=0.2,
        top_p=0.5,
        num_beams=1,
        max_new_tokens=256,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Image:{image_path} \nPrompt:{input_prompt} \nOutput:{outputs}")
```

For more information, please refer to [https://github.com/SHI-Labs/OLA-VLM](https://github.com/SHI-Labs/OLA-VLM).

## Citation

If you found our work useful in your research, please consider starring ⭐ us on [GitHub](https://github.com/SHI-Labs/OLA-VLM) and citing 📚 us in your research!

```
@article{jain2024ola_vlm,
    title={{OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation}},
    author={Jitesh Jain and Zhengyuan Yang and Humphrey Shi and Jianfeng Gao and Jianwei Yang},
    journal={arXiv},
    year={2024}
}
```