Why inference is very slow?

#17

by hanswang73 - opened Aug 6, 2023

Aug 6, 2023

Nvidia A40, 48GB-GPU-Mem, 80GB-CPU-Mem
cuda 11.8
transformers == 4.31.0
8bit quantization
use TextIteratorStreamer for inference
the speed is about 1 token per second

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment