模型推理很慢什么原因

by wangchenkang2023 - opened Sep 14, 2023

Sep 14, 2023

在v100_32G卡上进行部署推理，加载模型用了半精度.half(), 在推理过程中很慢，30分钟都没结果，我输入的token长度为1700多，是想实现text2SQL的能力

Oct 12, 2023

后面有解决或者缓解吗，我在A800上使用也有类似的问题，推理非常慢

Yhyu13

Oct 12, 2023

用的hf transformer吧，那是巨慢的，要用exllama+flash attention才能吃满cuda

Oct 17, 2023

用的hf transformer吧，那是巨慢的，要用exllama+flash attention才能吃满cuda

对，是hf transformer，GPU占用只有一半，我看官方的示例用的是transformer

Yhyu13

Oct 17, 2023

用的hf transformer吧，那是巨慢的，要用exllama+flash attention才能吃满cuda

对，是hf transformer，GPU占用只有一半，我看官方的示例用的是transformer

呵呵了，hf transformer就是乌龟爬

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment