Mrw33554432/bitLinear-phi-1.5 · Missing the RMSnorm and activation

Apr 17, 2024

According to the original paper, those 2 components were also being used.

Owner Apr 23, 2024

That's true, they are intentionally ignored in current version (since we want to test the performance module by module). I will check my readme to ensure it is clearly noticed.

Mrw33554432

Owner Apr 23, 2024

Another reason is that, if I understand correctly, the activation quant on input and the additional norm would even make the model slower than original model. You won't want to run quantization functions during inference. Based on my test, the model is 2-3x speed during inference by removing the weight quant, which means quantizations significantly influence inference efficiency. I would expect a similar speed drop by applying their activation quant - the model might run faster without it (very likely).

g1y5x3

May 16, 2024

https://github.com/microsoft/BitBLAS
So based on the benchmarks that was reported, there is a significant speed up when using INT8xINT8 (combine 4 2bits params) for BitLinear. I'm running some tests with this to verify. The biggest concern imo is the information loss from quantifying both the inputs and weights with a small model.

Mrw33554432
/

bitLinear-phi-1.5

Missing the RMSnorm and activation_quant?