Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

running model hangs on macOS M1 Sonoma

#6
by engiai - opened

model generation never completes on M1 mac running Sonoma OS

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

print('Loading model...')
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)

print('Generating input prompts...')
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt")

print('Running model...') # <- gets here
outputs = model.generate(**input_ids, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))
engiai changed discussion title from running model hands on macOS M1 Sonoma to running model hangs on macOS M1 Sonoma
Databricks org
β€’
edited Mar 27

You're running on CPU. It's going to take a very long time without a GPU so it's likely just still running. This would be the case for any big LLM, and this is quite big.

You can check out others 4-bit quantizations that might work a lot better on macs

Affirmative, thank you!

engiai changed discussion status to closed

You're running on CPU. It's going to take a very long time without a GPU so it's likely just still running. This would be the case for any big LLM, and this is quite big.

You can check out others 4-bit quantizations that might work a lot better on macs

Can I use bitsandbytes to load 8bit model of dbrx?

Sign up or log in to comment