Text Generation
Transformers
PyTorch
Chinese
bloom
Inference Endpoints
text-generation-inference

8卡的A100,推理速度过慢

#9
by yuguang - opened

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")
一共遇到3个问题,前两个已经解决了。
1、如果不加device_map="auto",载入模型的时候,不直接使用GPU,载入时间很长,这个目前已经解决了。
2、上面的这段代码载入,会报
(torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.40 GiB (GPU 0; 79.15 GiB total capacity; 68.53 GiB already allocated; 9.66 GiB free; 68.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF)
image.png
image.png
第一次载入失败。
重试了一次,成功载入
3、推理速度很慢,没有使用到GPU
model = model.eval()
inputs=tokenizer("Human:你是谁?\n\nAssistant: ",return_tensors='pt')
inputs.to(device)
output = model.generate(**inputs, do_sample=True, temperature=0.8, top_k=50, top_p=0.9, early_stopping=True, repetition_penalty=1.1, min_new_tokens=1, max_new_tokens=256)
此时会报warning信息:
You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate().
这里说inputs_id在gpu,而模型在cpu。
此时如果model.to(device),会报OOM。
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.30 GiB (GPU 0; 79.15 GiB total capacity; 77.72 GiB already allocated; 479.62 MiB free; 77.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
如果inputs.to("cpu"),推理速度很慢,完全用cpu进行推理。
这种情况怎么解决。

yuguang changed discussion title from 8卡的A100,载入时提示内存不够,但是只把一张卡占满了实际上 to 8卡的A100,载入时提示内存不够
yuguang changed discussion title from 8卡的A100,载入时提示内存不够 to 8卡的A100,推理速度过慢

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda')
gpus=[0,1,2,3,4,5,6,7]
tokenizer = AutoTokenizer.from_pretrained("/data/ygmeng/xuanyuan", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/data/ygmeng/xuanyuan",trust_remote_code=True,device_map="auto")

mo'xin
image.png
模型没有推理输出结果

Sign up or log in to comment