ChenMnZ's picture
Upload folder using huggingface_hub
83d5dc1 verified
|
raw
history blame
4.15 kB

Block-AP (EfficientQAT w/o E2E-AP)

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP).

In this repo, we provide the quantized checkpoints of Block-AP. Anyone can use them to reproduce our results or carry following research.

Performance

Model Quantization WikiText2 PPL Avg. Accuracy Model Size (GB) Hub link
Llama-2-7B fp16 5.47 64.86 13.2 -
Llama-2-7B w4g128 5.56 64.07 3.7 Link
Llama-2-7B w3g128 5.89 63.96 3.1 Link
Llama-2-7B w2g64 7.65 59.54 2.3 Link
Llama-2-7B w2g128 7.94 58.72 2.2 Link
Llama-2-13B fp16 4.88 67.81 25.4 -
Llama-2-13B w4g128 4.96 67.27 6.8 Link
Llama-2-13B w3g128 5.20 67.30 5.6 Link
Llama-2-13B w2g64 6.55 63.10 4.0 Link
Llama-2-13B w2g128 6.68 63.49 3.8 Link
Llama-2-70B fp16 3.32 72.41 131.6 -
Llama-2-70B w4g128 3.41 72.54 35.8 Link
Llama-2-70B w3g128 3.65 71.88 29.1 Link
Llama-2-70B w2g64 4.96 69.44 20.1 Link
Llama-2-70B w2g128 5.26 68.73 18.9 Link
Llama-3-8B fp16 6.14 68.58 13.0 -
Llama-3-8B w4g128 6.50 68.43 5.4 Link
Llama-3-8B w3g128 7.34 66.72 4.7 Link
Llama-3-8B w2g64 12.47 58.65 3.9 Link
Llama-3-8B w2g128 13.25 58.23 3.8 Link
Llama-3-70B fp16 2.85 75.33 137.8 -
Llama-3-70B w4g128 3.18 74.50 38.9 Link
Llama-3-70B w3g128 4.88 71.90 32.2 Link
Llama-3-70B w2g64 13.75 66.70 23.2 Link
Llama-3-70B w2g128 16.79 65.06 22.0 Link
Llama-3-8B-Instruct fp16 8.29 68.43 13.0 -
Llama-3-8B-Instruct w4g128 8.76 67.80 5.4 Link
Llama-3-8B-Instruct w3g128 9.83 66.54 4.7 Link
Llama-3-8B-Instruct w2g64 16.77 58.62 3.9 Link
Llama-3-8B-Instruct w2g128 18.02 57.19 3.8 Link
Llama-3-70B-Instruct fp16 5.33 73.78 137.8 -
Llama-3-70B-Instruct w4g128 5.77 73.52 38.9 Link
Llama-3-70B-Instruct w3g128 7.25 69.80 32.2 Link
Llama-3-70B-Instruct w2g64 12.48 65.60 23.2 Link
Llama-3-70B-Instruct w2g128 13.48 61.75 22.0 Link

Usage

Please refer https://github.com/OpenGVLab/EfficientQAT for details. These checkpoints can be used to following E2E-AP, as well as be inferenced directly.