Quantizations of https://huggingface.co/SeaLLMs/SeaLLM3-7B-Chat
Inference Clients/UIs
From original readme
Uses
SeaLLMs is tailored for handling a wide range of languages spoken in the SEA region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese.
This page introduces the SeaLLM3-7B-Chat model, specifically fine-tuned to follow human instructions effectively for task completion, making it directly applicable to your applications.
Get started with Transformers
To quickly try the model, we show how to conduct inference with transformers
below. Make sure you have installed the latest transformers version (>4.40).
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"SeaLLMs/SeaLLM3-7B-chat",
torch_dtype=torch.bfloat16,
device_map=device
)
tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")
# prepare messages to model
prompt = "Hiii How are you?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
print(f"Formatted text:\n {text}")
print(f"Model input:\n {model_inputs}")
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(f"Response:\n {response[0]}")
You can also utilize the following code snippet, which uses the streamer TextStreamer
to enable the model to continue conversing with you:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import TextStreamer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"SeaLLMs/SeaLLM3-7B-chat",
torch_dtype=torch.bfloat16,
device_map=device
)
tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")
# prepare messages to model
messages = [
{"role": "system", "content": "You are a helpful assistant."},
]
while True:
prompt = input("User:")
messages.append({"role": "user", "content": prompt})
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, streamer=streamer)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
messages.append({"role": "assistant", "content": response})
Inference with vllm
You can also conduct inference with vllm, which is a fast and easy-to-use library for LLM inference and serving. To use vllm, first install the latest version via pip install vllm
.
from vllm import LLM, SamplingParams
prompts = [
"Who is the president of US?",
"Can you speak Indonesian?"
]
llm = LLM(ckpt_path, dtype="bfloat16")
sparams = SamplingParams(temperature=0.1, max_tokens=512)
outputs = llm.generate(prompts, sparams)
# print out the model response
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}\nResponse: {generated_text}\n\n")
- Downloads last month
- 92