Text Generation
Transformers
PyTorch
mpt
Composer
MosaicML
llm-foundry
conversational
custom_code
text-generation-inference

Why inference is very slow?

#17
by hanswang73 - opened

Nvidia A40, 48GB-GPU-Mem, 80GB-CPU-Mem
cuda 11.8
transformers == 4.31.0
8bit quantization
use TextIteratorStreamer for inference
the speed is about 1 token per second

Sign up or log in to comment