metadata

license: cc-by-nc-sa-4.0
language:
  - en
pipeline_tag: text-generation
inference: false
tags:
  - gptq
  - auto-gptq
  - quantized

stablelm-tuned-alpha-3b-gptq-4bit-128g

This is a quantized model saved with auto-gptq. At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally.

See the below excerpt from the tutorial for details.

Auto-GPTQ Quick Start

Quick Installation

Start from v0.0.4, one can install auto-gptq directly from pypi using pip:

pip install auto-gptq

AutoGPTQ supports using triton to speedup inference, but it currently only supports Linux. To integrate triton, using:

pip install auto-gptq[triton]

For some people who want to try the newly supported llama type models in 🤗 Transformers but not update it to the latest version, using:

pip install auto-gptq[llama]

By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.

To disable building CUDA extension, you can use the following commands:

For Linux

BUILD_CUDA_EXT=0 pip install auto-gptq

For Windows

set BUILD_CUDA_EXT=0 && pip install auto-gptq

Basic Usage

The full script of basic usage demonstrated here is examples/quantization/basic_usage.py

The two main classes currently used in AutoGPTQ are AutoGPTQForCausalLM and BaseQuantizeConfig.

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

Load quantized model and do inference

Instead of .from_pretrained, you should use .from_quantized to load a quantized model.

device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)

This will first read and load quantize_config.json in opt-125m-4bit-128g directory, then based on the values of bits and group_size in it, load gptq_model-4bit-128g.bin model file into the first GPU.

Then you can initialize 🤗 Transformers' TextGenerationPipeline and do inference.

from transformers import TextGenerationPipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])

Conclusion

Congrats! You learned how to quickly install auto-gptq and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.