Code don't run

#4
by batman9x - opened

image.png

Why do I run code but not respond? How to fix that

i can't give you any advice without seeing your current code

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

image.png

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

image.png

Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try load_in_4bits=True, I do not know how much quantization will help as I have not tried it yet!

i can't give you any advice without seeing your current code

i use code example

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "vilm/vulture-40B"

tokenizer = AutoTokenizer.from_pretrained(model)
m = AutoModelForCausalLM.from_pretrained(model, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True )

prompt = "A chat between a curious user and an artificial intelligence assistant.\n\nUSER:Thành phố Hồ Chí Minh nằm ở đâu?<|endoftext|>ASSISTANT:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = m.generate(input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Can you give us more details on your hardware configuration and the status of your CPU or GPU when you run the code?

image.png

Hi, I believe you will need more than 24GB of VRAM to do inference on 40B. Maybe try load_in_4bits=True, I do not know how much quantization will help as I have not tried it yet!

You don't understand what I mean, what I mean is that the model loads normally, but when running, it freezes and nothing appears. Your entire code is missing a parameter, trust_remote_code=True, which when using your code then RWForCausalLM cannot use load_in_4bits ? I don't know if your side is mistaken somewhere :((

Sign up or log in to comment