From original readme

Uses

SeaLLMs is tailored for handling a wide range of languages spoken in the SEA region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.

This page introduces the SeaLLM3-7B-Chat model, specifically fine-tuned to follow human instructions effectively for task completion, making it directly applicable to your applications.

Get started with `Transformers`

To quickly try the model, we show how to conduct inference with transformers below. Make sure you have installed the latest transformers version (>4.40).

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
  "SeaLLMs/SeaLLM3-7B-chat",
  torch_dtype=torch.bfloat16, 
  device_map=device
)
tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")

# prepare messages to model
prompt = "Hiii How are you?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
print(f"Formatted text:\n {text}")
print(f"Model input:\n {model_inputs}")

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(f"Response:\n {response[0]}")

You can also utilize the following code snippet, which uses the streamer TextStreamer to enable the model to continue conversing with you:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import TextStreamer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
  "SeaLLMs/SeaLLM3-7B-chat",
  torch_dtype=torch.bfloat16, 
  device_map=device
)
tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")

# prepare messages to model
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
]

while True:
    prompt = input("User:")
    messages.append({"role": "user", "content": prompt})
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, streamer=streamer)
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    messages.append({"role": "assistant", "content": response})

Inference with `vllm`

You can also conduct inference with vllm, which is a fast and easy-to-use library for LLM inference and serving. To use vllm, first install the latest version via pip install vllm.

from vllm import LLM, SamplingParams

prompts = [
    "Who is the president of US?",
    "Can you speak Indonesian?"
]

llm = LLM(ckpt_path, dtype="bfloat16")
sparams = SamplingParams(temperature=0.1, max_tokens=512)
outputs = llm.generate(prompts, sparams)

# print out the model response
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}\nResponse: {generated_text}\n\n")

Inference Clients/UIs

From original readme

Uses

Get started with Transformers

Inference with vllm

Get started with `Transformers`

Inference with `vllm`