metadata

inference: false
language:
  - th
  - en
library_name: transformers
tags:
  - instruct
  - chat
license: llama3

Typhoon-Vision Research Preview

llama-3-typhoon-v1.5-8b-vision-preview is a 🇹🇭 Thai vision-language model. It supports both text and image input modalities natively while the output is text. This version (August 2024) is our first vision-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.

More details can be found in our release blog. *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

Model Description

Here we provide Llama3 Typhoon Instruct Vision Preview which is built upon Llama-3-Typhoon-1.5-8B-instruct and SigLIP.

We base off our architecture from Bunny by BAAI.

Model type: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
Requirement: transformers 4.38.0 or newer.
Primary Language(s): Thai 🇹🇭 and English 🇬🇧
License: Llama 3 Community License

Quickstart

Here we show a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

pip install torch transformers accelerate pillow

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import io
import requests

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# Set Device
device = 'cuda'  # or cpu
torch.set_default_device(device)

# Create Model
model = AutoModelForCausalLM.from_pretrained(
    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
    trust_remote_code=True)

def prepare_inputs(text, has_image=False, device='cuda'):
    messages = [
        {"role": "system", "content": "You are a helpful vision-capable assistant who eagerly converses with the user in their language."},
    ]
    
    if has_image:
        messages.append({"role": "user", "content": "<|image|>\n" + text})
    else:
        messages.append({"role": "user", "content": text})
    
    inputs_formatted = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )

    if has_image:
        text_chunks = [tokenizer(chunk).input_ids for chunk in inputs_formatted.split('<|image|>')]
        input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)
        attention_mask = torch.ones_like(input_ids).to(device)
    else:
        input_ids = torch.tensor(tokenizer(inputs_formatted).input_ids, dtype=torch.long).unsqueeze(0).to(device)
        attention_mask = torch.ones_like(input_ids).to(device)

    return input_ids, attention_mask

# Example Inputs (try replacing with your own url)
prompt = 'บอกทุกอย่างที่เห็นในรูป'
img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
image = Image.open(io.BytesIO(requests.get(img_url).content))
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)

# Generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=1000,
    use_cache=True,
    temperature=0.2,
    top_p=0.2,
    repetition_penalty=1.0 # increase this to avoid chattering,
)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Intended Uses & Limitations

This model is experimental and might not be fully evaluated for all use cases. Developers should assess risks in the context of their specific applications.

https://twitter.com/opentyphoon

Support

https://discord.gg/CqyBscMFpg

scb10x
/

llama-3-typhoon-v1.5-8b-vision-preview

Typhoon-Vision Research Preview

Model Description

Quickstart

Intended Uses & Limitations

Follow us

Support