Error in using this model for inference in Google Colab

#1
by sudhir2016 - opened

Load model
model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

Generate
prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

This is the error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

sudhir2016 changed discussion title from Error in using this model for inference on in Google Colab to Error in using this model for inference in Google Colab
Mobius Labs GmbH org

You forgot to put the tokenized input on the gpu

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)

prompt = "Capital of India"
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
generate_ids = model.generate(inputs.input_ids, max_length=30)[0]
print(tokenizer.decode(generate_ids))

Output:

<s> Capital of India, Delhi is a city of contrasts. surely, the city is a blend of the old and the new. The

Thank you so much it works now !!

sudhir2016 changed discussion status to closed

Sign up or log in to comment