|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- gptq |
|
- auto-gptq |
|
- quantized |
|
--- |
|
|
|
# stablelm-tuned-alpha-3b-gptq-4bit-128g |
|
|
|
This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally. |
|
|
|
See the below [excerpt from the tutorial](https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) for details. |
|
|
|
--- |
|
|
|
# Auto-GPTQ Quick Start |
|
|
|
## Quick Installation |
|
|
|
Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`: |
|
```shell |
|
pip install auto-gptq |
|
``` |
|
|
|
AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using: |
|
```shell |
|
pip install auto-gptq[triton] |
|
``` |
|
|
|
For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using: |
|
```shell |
|
pip install auto-gptq[llama] |
|
``` |
|
|
|
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed. |
|
|
|
To disable building CUDA extension, you can use the following commands: |
|
|
|
For Linux |
|
```shell |
|
BUILD_CUDA_EXT=0 pip install auto-gptq |
|
``` |
|
For Windows |
|
```shell |
|
set BUILD_CUDA_EXT=0 && pip install auto-gptq |
|
``` |
|
|
|
## Basic Usage |
|
*The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`* |
|
|
|
The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`. |
|
```python |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig |
|
``` |
|
### Load quantized model and do inference |
|
|
|
Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model. |
|
```python |
|
device = "cuda:0" |
|
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True) |
|
``` |
|
This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU. |
|
|
|
Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference. |
|
```python |
|
from transformers import TextGenerationPipeline |
|
|
|
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device) |
|
print(pipeline("auto-gptq is")[0]["generated_text"]) |
|
``` |
|
|
|
## Conclusion |
|
Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations. |