smpanaro's picture
Update README.md
e548561 verified
metadata
license: mit
datasets:
  - wikitext

gpt2-xl quantized to 4-bit using AutoGPTQ.

To use, first install AutoGPTQ:

pip install auto-gptq

Then load the model from the hub:

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name = "smpanaro/gpt2-xl-AutoGPTQ-4bit-128g"
model = AutoGPTQForCausalLM.from_quantized(model_name, use_triton=True)
# Note: despite this model being quantized only using groups and desc_act=False, Triton still seems to be required.
Model 4-Bit Perplexity 16-Bit Perplexity Delta
smpanaro/gpt2-AutoGPTQ-4bit-128g 26.5000 25.1875 1.3125
smpanaro/gpt2-medium-AutoGPTQ-4bit-128g 19.1719 18.4739 0.698
smpanaro/gpt2-large-AutoGPTQ-4bit-128g 16.6875 16.4541 0.2334
smpanaro/gpt2-xl-AutoGPTQ-4bit-128g 14.9297 14.7951 0.1346
Wikitext perplexity measured as in the huggingface docs, lower is better