Potential ways to reduce inference latency on CPU cluster?

#11
by TheBacteria - opened

What are some of the potential approaches or arguments that can help reduce the inference latency on a CPU cluster?

hi @TheBacteria , Intel provides the effective LLM quantization tool (Intel Neural Compressor https://github.com/intel/neural-compressor ) to generate low-bit model (e.g., INT4/FP4/NF4, and INT8) and LLM runtime (Intel Extension for Transformers https://github.com/intel/intel-extension-for-transformers/tree/main ) to demonstrate the inference efficiency on Intel platforms by extending Hugging Face Transformers APIs.

you can also refer the paper: https://huggingface.co/papers/2311.16133 and blog: https://medium.com/intel-analytics-software/efficient-streaming-llm-with-intel-extension-for-transformers-runtime-31ee24577d26

Fantastic! Thanks so much @lvkaokao for such a helpful response.

Sign up or log in to comment