license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- gptq
- auto-gptq
- quantized
stablelm-tuned-alpha-3b-gptq-4bit-128g
This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally.
See the below excerpt from the tutorial for details.
Auto-GPTQ Quick Start
Quick Installation
Start from v0.0.4, one can install auto-gptq
directly from pypi using pip
:
pip install auto-gptq
AutoGPTQ supports using triton
to speedup inference, but it currently only supports Linux. To integrate triton, using:
pip install auto-gptq[triton]
For some people who want to try the newly supported llama
type models in 🤗 Transformers but not update it to the latest version, using:
pip install auto-gptq[llama]
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
BUILD_CUDA_EXT=0 pip install auto-gptq
For Windows
set BUILD_CUDA_EXT=0 && pip install auto-gptq
Basic Usage
The full script of basic usage demonstrated here is examples/quantization/basic_usage.py
The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM
and BaseQuantizeConfig
.
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
Load quantized model and do inference
Instead of .from_pretrained
, you should use .from_quantized
to load a quantized model.
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
This will first read and load quantize_config.json
in opt-125m-4bit-128g
directory, then based on the values of bits
and group_size
in it, load gptq_model-4bit-128g.bin
model file into the first GPU.
Then you can initialize 🤗 Transformers' TextGenerationPipeline
and do inference.
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
Conclusion
Congrats! You learned how to quickly install auto-gptq
and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.