very slow inference speed on 2x A100 80GB with 4-bit (main branch)

#6
by willowill5 - opened

Thank you so much for this work! I am currently trying to deploy a realtime inference service and am seeing quite slow speeds. Is there something I can do to speed this up? I have noticed that the GPU usage is quite low..

ENV:
Latest PyTorch
pip install transformers==4.33.0
python -m pip install git+https://github.com/huggingface/optimum.git
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
CUDA 11.8

CODE:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name = "TheBloke/Falcon-180B-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=50,
temperature=0.7,
do_sample=True,
top_p=0.95,
repetition_penalty=1.15,
device_map="auto"
)
prompt = "you are a bot who..."
prompt_template=f'''{prompt}
Assistant: '''
print(pipe(prompt_template)[0]['generated_text'])

Sign up or log in to comment