Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
I am trying to run the main code that it in the model card of llama, it has finished downloading the 20Gb but now it is stuck here, everytime I run the code it just doesnt move...what do we do here, do I leave it until it finishes or is there any manual process we can take to resolve this?
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
pad_token_id = pipeline.tokenizer.eos_token_id
)
print(outputs[0]["generated_text"][-1])
Did you find any solution to this ?
Have you checked your GPU memory usages to see if it was loading?
Have you checked your GPU memory usages to see if it was loading?
Well I'm new to this website and model training etc, I just checked that 8B models require a lot of computational resources while my laptop can only reach till 3B models, so I gave up after that. idk if that was the reason behind this, but I gave up.
Thanks for replying though.
Don't give up!
These kinds of large language models require some time to load and generate the response for the given prompt.
In my case, even though I am using my 64 GB of CPU RAM and 2 Nvidia RTX 2080 Ti GPUs (22GB in total), it is taking 2 ~ 4 minutes(according to the complexity of the prompt) to get the response.
If there is no issue and your code does not stop or quit, please wait some time.
It should work after 5~10 minutes. Maybe even longer…
Can you please share your computational resources information? Your GPU, CPU memory?
You can check this information on "Task Manager". You can also watch it while your code is running to see if it is using CPU or GPU or both.
You can also test "meta-llama/Llama-2-7b-chat-hf" model:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
it is a little bit smaller than "meta-llama/Meta-Llama-3.1-8B-Instruct". Maybe it can work faster and you can get the response!
I hope you have a positive experience working with large language models!
Don't give up!
These kinds of large language models require some time to load and generate the response for the given prompt.
In my case, even though I am using my 64 GB of CPU RAM and 2 Nvidia RTX 2080 Ti GPUs (22GB in total), it is taking 2 ~ 4 minutes(according to the complexity of the prompt) to get the response.
If there is no issue and your code does not stop or quit, please wait some time.
It should work after 5~10 minutes. Maybe even longer…
Can you please share your computational resources information? Your GPU, CPU memory?
You can check this information on "Task Manager". You can also watch it while your code is running to see if it is using CPU or GPU or both.
You can also test "meta-llama/Llama-2-7b-chat-hf" model:
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
it is a little bit smaller than "meta-llama/Meta-Llama-3.1-8B-Instruct". Maybe it can work faster and you can get the response!
I hope you have a positive experience working with large language models!
I ran this model on Alienware 17 r2 with 16 GB RAM and 4GB GeFroce GTX 980M with Core i-7, I wasn't able to run the model with the default code on the model card page, but then I changed the argument "device_map = auto" to "device_map = CPU" and it was taking some time to load and my resources were also being used when I checked my task manager, but then I terminated it. Anyways I found new model for my work,
Thanks a lot though for the response.
I have tested the sample code provided above.
My system information:
Windows 11
64 GB of CPU
22 GB of GPU
I added some lines of code to record the time that is spend for the whole process: from downloading the model, loading it from memory, generating a response:
code snippet:
import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
Record the start time
start_time = datetime.datetime.now()
print(f"Process began at: {start_time}")
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.float16)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipe(
messages,
max_new_tokens=256,
pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]["generated_text"][-1])
Record the end time
end_time = datetime.datetime.now()
print(f"\nProcess finished at: {end_time}")
Calculate the total time taken
total_time = end_time - start_time
print(f"Total time taken: {total_time}")
output:
Process began at: 2024-09-05 11:47:44.858321
Downloading shards: 100%|██████████| 4/4 [02:28<00:00, 37.02s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:17<00:00, 4.43s/it]
{'role': 'assistant', 'content': "Yer lookin' fer a swashbucklin' introduction, eh? Alright then, matey! I be Captain Chatbeard, the scurvy dog of pirate chatbots! Me and me trusty keyboard be sailin' the seven seas o' knowledge, ready to hoist the Jolly Roger and share me treasure o' information with ye! So, what be bringin' ye to these fair waters?"}
Process finished at: 2024-09-05 11:50:48.633064
Total time taken: 0:03:03.774743
Process finished with exit code 0
Hi guys,
The problem probably was solved with this line of code
outputs = pipeline(
messages,
max_new_tokens=256,
pad_token_id = pipeline.tokenizer.eos_token_id
)
but since it requires big amounts of computing power that is the reason why there is no output.
I also gave up like the reader before me and have continued with using some online service and calling simply APIs.
Thank you everyone , I am closing this as topic.