8卡的A100,推理速度过慢
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")
一共遇到3个问题,前两个已经解决了。
1、如果不加device_map="auto",载入模型的时候,不直接使用GPU,载入时间很长,这个目前已经解决了。
2、上面的这段代码载入,会报
(torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.40 GiB (GPU 0; 79.15 GiB total capacity; 68.53 GiB already allocated; 9.66 GiB free; 68.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF)
第一次载入失败。
重试了一次,成功载入
3、推理速度很慢,没有使用到GPU
model = model.eval()
inputs=tokenizer("Human:你是谁?\n\nAssistant: ",return_tensors='pt')
inputs.to(device)
output = model.generate(**inputs, do_sample=True, temperature=0.8, top_k=50, top_p=0.9, early_stopping=True, repetition_penalty=1.1, min_new_tokens=1, max_new_tokens=256)
此时会报warning信息:
You are calling .generate() with the input_ids
being on a device type different than your model's device. input_ids
is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids
to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate()
.
这里说inputs_id在gpu,而模型在cpu。
此时如果model.to(device),会报OOM。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.30 GiB (GPU 0; 79.15 GiB total capacity; 77.72 GiB already allocated; 479.62 MiB free; 77.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
如果inputs.to("cpu"),推理速度很慢,完全用cpu进行推理。
这种情况怎么解决。
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")