Falcon 7b instruct using cpu for inference even on NVIDIA A40 cards with 50GB VRAM

#70
by Akshadv - opened

tokenizer = AutoTokenizer.from_pretrained(model_path)

falcon_pipeline = pipeline(
"text-generation",
model=model_path,
tokenizer=tokenizer,
max_new_tokens=256,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map = 'auto',
do_sample=True,
top_k = 10,
temperature=0.7,
eos_token_id=tokenizer.eos_token_id
)

using this code + llmchain for inference am i doing something wrong or any thing needs to be fixed to get full inference on gpu?

CPU is always hitting 100%

Sign up or log in to comment