File size: 2,643 Bytes
1ac5f63
 
0cbf47e
 
 
 
 
 
 
 
1ac5f63
0cbf47e
 
 
 
 
62b3563
0cbf47e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- gptq
- auto-gptq
- quantized
---

# stablelm-tuned-alpha-3b-gptq-4bit-128g

This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub,  but will need to clone the repo and load locally.

See the below [excerpt from the tutorial](https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) for details.

---

# Auto-GPTQ Quick Start

## Quick Installation

Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
```shell
pip install auto-gptq
```

AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using:
```shell
pip install auto-gptq[triton]
```

For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
```shell
pip install auto-gptq[llama]
```

By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.

To disable building CUDA extension, you can use the following commands:

For Linux
```shell
BUILD_CUDA_EXT=0 pip install auto-gptq
```
For Windows
```shell
set BUILD_CUDA_EXT=0 && pip install auto-gptq
```

## Basic Usage
*The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`*

The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
```
### Load quantized model and do inference 

Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
```python
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
```
This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU.

Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
```python
from transformers import TextGenerationPipeline

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
```

## Conclusion
Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.