after sft,model inference problem [probability tensor contains either `inf`, `nan` or element < 0]

#50
by Saicy - opened

Traceback (most recent call last):
File "/mnt/bn/intelligent-chatbot/FastChat_v2/LLaMA-Efficient-Tuning/src/inference.py", line 504, in
main(args)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/bn/intelligent-chatbot/FastChat_v2/LLaMA-Efficient-Tuning/src/inference.py", line 380, in main
outputs_tokenized = model.generate(**prompts_tokenized, do_sample=True,max_new_tokens=512,pad_token_id=tokenizer.eos_token_id,temperature=0.3)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2734, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

Google org

Hi @Saicy
How do you run inference? It seems you are using Llama-factory. Can you elaborate on how you're getting that error? Is it after fine-tuning?

Hi @Saicy
How do you run inference? It seems you are using Llama-factory. Can you elaborate on how you're getting that error? Is it after fine-tuning?

Yes,after I finish the fine-tune.I can't run infernece.
with frame of acclerate
outputs_tokenized = model.generate(**prompts_tokenized, do_sample=True,max_new_tokens=512,pad_token_id=tokenizer.eos_token_id,temperature=0.3)
outputs_tokenized=[tok_out[len(tok_in):] for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]
outputs=tokenizer.batch_decode(outputs_tokenized,skip_special_tokens=True)

Google org

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

get it!

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

I train with parameter bf16

Sign up or log in to comment