Post
QuIP# ecosystem is growing :)
I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference.
This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.
keyfan/Qwen-72B-Chat-2bit
Also notice the easier to use Quip# for all library :)
https://github.com/chu-tianxiang/QuIP-for-all
I've seen a quip# 2 bit Qwen-72b-Chat model today on the hub that shows there is support for vLLM inference.
This will speed up inference and make high performing 2 bit models more practical. I'm considering quipping MoMo now, as I can only use brief context window of Qwen-72b on my system otherwise, even with bnb double quantization.
keyfan/Qwen-72B-Chat-2bit
Also notice the easier to use Quip# for all library :)
https://github.com/chu-tianxiang/QuIP-for-all