Model is not generating an answer or it takes a really long time
Hello friends, I am pretty new here.
I just want to run the Zephyr in my local machine.
I have 16 GB RAM and a 6GB RTX3060.
This is the code in the documentation:
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
It worked once, after a really long long time. Now it doesn't even work.
I get this response:WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu and disk.
I assume this code block tries to load the whole model and that way it either gets stuck or doesn;t work because it wants to use more than my system specs.
But if I have this issue with my specs, I assume many people should be experiencing in their machines also.
I saw that some people have implementations of this model, it is served on Hugging Face and it is really quick.
My question is, is there any improvements we can do on this code block? I believe this will also help the others.
ur gpu has too little memory so some params are moved onto the cpu to use the ram. it then uses the cpu to do inference which is extremely slow so the behaviour is to be expected. get a bigger gpu.
https://huggingface.co/spaces/Vokturz/can-it-run-llm