neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16

Jul 12

Can anyone give me some example code to use this model with vllm library?
I'm a newbie on LLM and vllm library.

Especially, I want what method or string should be in to quantization parameter:
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=4, quantization=)

ShiningJazz changed discussion title from Example request with vllm to Code example request with vllm Jul 12

robertgshaw2

Neural Magic org Jul 12

You can just run with:

from vllm import LLM
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=4)
output = model.generate("Hello my name is")

abhinavnmagic

Jul 12

•

edited Jul 12

You need not specify the quantization argument since it will be inferred from the checkpoint.
You could use the following code snippet:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=300)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=4)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

abhinavnmagic changed discussion status to closed Jul 12