xyz-nlp/XuanYuan2.0 · 8卡的A100，推理速度过慢

Jun 25, 2023

•

edited Jun 25, 2023

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")
一共遇到3个问题，前两个已经解决了。
1、如果不加device_map="auto"，载入模型的时候，不直接使用GPU，载入时间很长，这个目前已经解决了。
2、上面的这段代码载入，会报
(torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.40 GiB (GPU 0; 79.15 GiB total capacity; 68.53 GiB already allocated; 9.66 GiB free; 68.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF)

第一次载入失败。
重试了一次，成功载入
3、推理速度很慢，没有使用到GPU
model = model.eval()
inputs=tokenizer("Human:你是谁？\n\nAssistant: ",return_tensors='pt')
inputs.to(device)
output = model.generate(**inputs, do_sample=True, temperature=0.8, top_k=50, top_p=0.9, early_stopping=True, repetition_penalty=1.1, min_new_tokens=1, max_new_tokens=256)
此时会报warning信息：
You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate().
这里说inputs_id在gpu，而模型在cpu。
此时如果model.to(device)，会报OOM。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.30 GiB (GPU 0; 79.15 GiB total capacity; 77.72 GiB already allocated; 479.62 MiB free; 77.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
如果inputs.to("cpu")，推理速度很慢，完全用cpu进行推理。
这种情况怎么解决。

yuguang changed discussion title from 8卡的A100，载入时提示内存不够，但是只把一张卡占满了实际上 to 8卡的A100，载入时提示内存不够 Jun 25, 2023

yuguang changed discussion title from 8卡的A100，载入时提示内存不够 to 8卡的A100，推理速度过慢 Jun 25, 2023

MengAI

Nov 14, 2023

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")

lhlnlp

Dec 29, 2023

mo'xin

模型没有推理输出结果