Speed comparison with fastertransformer for ChatGLM?
Hi, great work and thanks for sharing.
According to the post(https://mp.weixin.qq.com/s/uV4Y_q4GnTUAsRVHxJGxGA), the inference code is based on FT, and customization has been made for speed.
Can you kindly share the speed comparison between FT and lyraCharGLM?
Thanks.
@xiangli Hi, original FT doesn't naturally support ChatGLM ( different op behaviors), we're still working on fix all these problems and will report a pure FT version speed later.
@xiangli We have updated to a new accelerated version and removed the previous TensorRT acceleration version. The new version has undergone significant optimization at the source code level, resulting in improved performance, ease of use, and GPU compatibility. Please update and feel free to try it out.