|
--- |
|
inference: false |
|
language: |
|
- th |
|
- en |
|
library_name: transformers |
|
tags: |
|
- instruct |
|
- chat |
|
license: llama3 |
|
--- |
|
|
|
# **Typhoon-Vision Preview** |
|
|
|
**llama-3-typhoon-v1.5-8b-vision-preview** is a 🇹🇭 Thai *vision-language* model. It supports both text and image input modalities natively while the output is text. This version (August 2024) is our first vision-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). |
|
|
|
More details can be found in our [release blog](https://medium.com/opentyphoon/typhoon-vision-preview-release-0bdef028ca55) and technical report (coming soon). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.* |
|
|
|
# **Model Description** |
|
Here we provide **Llama3 Typhoon Instruct Vision Preview** which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384). |
|
|
|
We base off our training recipe from [Bunny by BAAI](https://github.com/BAAI-DCAI/Bunny). |
|
|
|
- **Model type**: A 8B instruct decoder-only model with vision encoder based on Llama architecture. |
|
- **Requirement**: transformers 4.38.0 or newer. |
|
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧 |
|
- **Demo:** [https://vision.opentyphoon.ai/](https://vision.opentyphoon.ai/) |
|
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/) |
|
|
|
# **Quickstart** |
|
|
|
Here we show a code snippet to show you how to use the model with transformers. |
|
|
|
Before running the snippet, you need to install the following dependencies: |
|
|
|
```shell |
|
pip install torch transformers accelerate pillow |
|
``` |
|
|
|
```python |
|
import torch |
|
import transformers |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from PIL import Image |
|
import warnings |
|
import io |
|
import requests |
|
|
|
# disable some warnings |
|
transformers.logging.set_verbosity_error() |
|
transformers.logging.disable_progress_bar() |
|
warnings.filterwarnings('ignore') |
|
|
|
# Set Device |
|
device = 'cuda' # or cpu |
|
torch.set_default_device(device) |
|
|
|
# Create Model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview', |
|
torch_dtype=torch.float16, # float32 for cpu |
|
device_map='auto', |
|
trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview', |
|
trust_remote_code=True) |
|
|
|
def prepare_inputs(text, has_image=False, device='cuda'): |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful vision-capable assistant who eagerly converses with the user in their language."}, |
|
] |
|
|
|
if has_image: |
|
messages.append({"role": "user", "content": "<|image|>\n" + text}) |
|
else: |
|
messages.append({"role": "user", "content": text}) |
|
|
|
inputs_formatted = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
tokenize=False |
|
) |
|
|
|
if has_image: |
|
text_chunks = [tokenizer(chunk).input_ids for chunk in inputs_formatted.split('<|image|>')] |
|
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device) |
|
attention_mask = torch.ones_like(input_ids).to(device) |
|
else: |
|
input_ids = torch.tensor(tokenizer(inputs_formatted).input_ids, dtype=torch.long).unsqueeze(0).to(device) |
|
attention_mask = torch.ones_like(input_ids).to(device) |
|
|
|
return input_ids, attention_mask |
|
|
|
# Example Inputs (try replacing with your own url) |
|
prompt = 'บอกทุกอย่างที่เห็นในรูป' |
|
img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg" |
|
image = Image.open(io.BytesIO(requests.get(img_url).content)) |
|
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device) |
|
input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device) |
|
|
|
# Generate |
|
output_ids = model.generate( |
|
input_ids, |
|
images=image_tensor, |
|
max_new_tokens=1000, |
|
use_cache=True, |
|
temperature=0.2, |
|
top_p=0.2, |
|
repetition_penalty=1.0 # increase this to avoid chattering, |
|
)[0] |
|
|
|
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) |
|
``` |
|
|
|
# Evaluation Results |
|
| Model | MMBench (Dev) | Pope | GQA | GQA (Thai) | |
|
|:--|:--|:--|:--|:--| |
|
| Typhoon-Vision 8B Preview | 70.9 | 84.8 | 62.0 | 43.6 | |
|
| SeaLMMM 7B v0.1 | 64.8 | 86.3 | 61.4 | 25.3 | |
|
| Bunny Llama3 8B Vision | 76.0 | 86.9 | 64.8 | 24.0 | |
|
| GPT-4o Mini | 69.8 | 45.4 | 42.6 | 18.1 | |
|
|
|
# Intended Uses & Limitations |
|
This model is experimental and might not be fully evaluated for all use cases. Developers should assess risks in the context of their specific applications. |
|
|
|
# Follow Us & Support |
|
- https://twitter.com/opentyphoon |
|
- https://discord.gg/CqyBscMFpg |
|
|
|
# Acknowledgements |
|
We would like to thank the Bunny team for open-sourcing their code and data, and thanks to the Google Team for releasing the fine-tuned SigLIP which we adopt for our vision encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights. |
|
|
|
## Typhoon Team |
|
Parinthapat Pengpun, Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul |