Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Block-AP (EfficientQAT w/o E2E-AP)

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP).

In this repo, we provide the quantized checkpoints of Block-AP. Anyone can use them to reproduce our results or carry following research.

Performance

Model Quantization WikiText2 PPL Avg. Accuracy Model Size (GB) Hub link
Llama-2-7B fp16 5.47 64.86 13.2 -
Llama-2-7B w4g128 5.56 64.07 3.7 Link
Llama-2-7B w3g128 5.89 63.96 3.1 Link
Llama-2-7B w2g64 7.65 59.54 2.3 Link
Llama-2-7B w2g128 7.94 58.72 2.2 Link
Llama-2-13B fp16 4.88 67.81 25.4 -
Llama-2-13B w4g128 4.96 67.27 6.8 Link
Llama-2-13B w3g128 5.20 67.30 5.6 Link
Llama-2-13B w2g64 6.55 63.10 4.0 Link
Llama-2-13B w2g128 6.68 63.49 3.8 Link
Llama-2-70B fp16 3.32 72.41 131.6 -
Llama-2-70B w4g128 3.41 72.54 35.8 Link
Llama-2-70B w3g128 3.65 71.88 29.1 Link
Llama-2-70B w2g64 4.96 69.44 20.1 Link
Llama-2-70B w2g128 5.26 68.73 18.9 Link
Llama-3-8B fp16 6.14 68.58 13.0 -
Llama-3-8B w4g128 6.50 68.43 5.4 Link
Llama-3-8B w3g128 7.34 66.72 4.7 Link
Llama-3-8B w2g64 12.47 58.65 3.9 Link
Llama-3-8B w2g128 13.25 58.23 3.8 Link
Llama-3-70B fp16 2.85 75.33 137.8 -
Llama-3-70B w4g128 3.18 74.50 38.9 Link
Llama-3-70B w3g128 4.88 71.90 32.2 Link
Llama-3-70B w2g64 13.75 66.70 23.2 Link
Llama-3-70B w2g128 16.79 65.06 22.0 Link
Llama-3-8B-Instruct fp16 8.29 68.43 13.0 -
Llama-3-8B-Instruct w4g128 8.76 67.80 5.4 Link
Llama-3-8B-Instruct w3g128 9.83 66.54 4.7 Link
Llama-3-8B-Instruct w2g64 16.77 58.62 3.9 Link
Llama-3-8B-Instruct w2g128 18.02 57.19 3.8 Link
Llama-3-70B-Instruct fp16 5.33 73.78 137.8 -
Llama-3-70B-Instruct w4g128 5.77 73.52 38.9 Link
Llama-3-70B-Instruct w3g128 7.25 69.80 32.2 Link
Llama-3-70B-Instruct w2g64 12.48 65.60 23.2 Link
Llama-3-70B-Instruct w2g128 13.48 61.75 22.0 Link

Usage

Please refer https://github.com/OpenGVLab/EfficientQAT for details. These checkpoints can be used to following E2E-AP, as well as be inferenced directly.

Downloads last month
0
Safetensors
Model size
2.03B params
Tensor type
I32
·
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including ChenMnZ/Llama-2-13b-BlockAP-w4g128