How to use with vllm:
from vllm import LLM, SamplingParams
inputs = [
"Who is the president of US?",
"Can you speak Indonesian?"
]
# Initialize the LLM model
llm = LLM(model="jester6136/SeaLLMs-v3-1.5B-Chat-AWQ",
quantization="AWQ",
gpu_memory_utilization=0.9,
max_model_len=2000,
max_num_seqs=32)
sparams = SamplingParams(temperature=0.0, max_tokens=2000, top_p=0.95,top_k=0.95,repetition_penalty=1.05)
chat_template = '<|im_start|> user \n {input} <|im_end|>\n<|im_start|>assistant\n'
prompts = [chat_template.format(input=prompt) for prompt in inputs]
outputs = llm.generate(prompts, sparams)
# print out the model response
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}\nResponse: {generated_text}\n\n")
- Downloads last month
- 8
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.