inference take more than 10 min
#38
by
shravanveldurthi
- opened
I am following the same code provided here for TheBloke/Llama-2-70B-chat-GPTQ. If I run on Google Colab (CUDA Version 12.0 and Torch version 2.0.1+cu118), I get summarization in less than 40 secs. Where as on Linux VM (CUDA Version 12.2 and Torch version 2.0.1+cu117), following warnings
- CUDA extension not installed.
- skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
And it takes more than 500 seconds to provide summarization.
How to fix the error