after sft,model inference problem [probability tensor contains either `inf`, `nan` or element < 0]

#50

by Saicy - opened Feb 28

Feb 28

Traceback (most recent call last):
File "/mnt/bn/intelligent-chatbot/FastChat_v2/LLaMA-Efficient-Tuning/src/inference.py", line 504, in
main(args)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/bn/intelligent-chatbot/FastChat_v2/LLaMA-Efficient-Tuning/src/inference.py", line 380, in main
outputs_tokenized = model.generate(**prompts_tokenized, do_sample=True,max_new_tokens=512,pad_token_id=tokenizer.eos_token_id,temperature=0.3)
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/generation/utils.py", line 2734, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

ybelkada

Google org Feb 28

Hi @Saicy
How do you run inference? It seems you are using Llama-factory. Can you elaborate on how you're getting that error? Is it after fine-tuning?

Saicy

Feb 28

Hi @Saicy
How do you run inference? It seems you are using Llama-factory. Can you elaborate on how you're getting that error? Is it after fine-tuning?

Yes,after I finish the fine-tune.I can't run infernece.
with frame of acclerate
outputs_tokenized = model.generate(**prompts_tokenized, do_sample=True,max_new_tokens=512,pad_token_id=tokenizer.eos_token_id,temperature=0.3)
outputs_tokenized=[tok_out[len(tok_in):] for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ]
outputs=tokenizer.batch_decode(outputs_tokenized,skip_special_tokens=True)

ybelkada

Google org Feb 28

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

Saicy

Feb 28

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

get it!

Saicy

Feb 28

hmm @Saicy this usually happens when you have NaN in your hidden states. have you trained your model in fp16 by any chance?
if that's the case you should either switch to bf16 or fp32 + mixed precision training

I train with parameter bf16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment