How to quantize INT4 for Llama-v2-7B-Chat model

#5
by TaeYeon39 - opened

Hi,
According to the guide (https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized)
The sample demo model is INT8 Quantization model. (--target_runtime qnn_context_binary --quantize_full_type w8a16 --quantize_io)
pip install "qai_hub_models[llama_v2_7b_chat_quantized]"
python -m qai_hub_models.models.llama_v2_7b_chat_quantized.demo

How to quantize INT4 Llama-v2-7B-Chat model for Android Device ?
I want to test it on my Android Device. not AI Hub.

to quantize INT4, Is it right to change as like below ?
/qai_hub_models/models/_shared/llama/model.py
return " --target_runtime qnn_context_binary --quantize_full_type w4a16 --quantize_io"

Qualcomm org
edited Jul 9

This model on AI Hub is mostly int4 and a few layers are int8. This was done to preserve quality and enhance performance as much as possible. Hence, we specify w8a16 when running it on AI Hub. So this model should work for you (on your device) out of the box without further quantizing.

TaeYeon39 changed discussion status to closed

Sign up or log in to comment