pszemraj's picture
Update README.md
62b3563
|
raw
history blame
2.64 kB
---
license: cc-by-nc-sa-4.0
language:
- en
pipeline_tag: text-generation
inference: false
tags:
- gptq
- auto-gptq
- quantized
---
# stablelm-tuned-alpha-3b-gptq-4bit-128g
This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally.
See the below [excerpt from the tutorial](https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) for details.
---
# Auto-GPTQ Quick Start
## Quick Installation
Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
```shell
pip install auto-gptq
```
AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using:
```shell
pip install auto-gptq[triton]
```
For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
```shell
pip install auto-gptq[llama]
```
By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
To disable building CUDA extension, you can use the following commands:
For Linux
```shell
BUILD_CUDA_EXT=0 pip install auto-gptq
```
For Windows
```shell
set BUILD_CUDA_EXT=0 && pip install auto-gptq
```
## Basic Usage
*The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`*
The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
```
### Load quantized model and do inference
Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
```python
device = "cuda:0"
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
```
This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU.
Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
```python
from transformers import TextGenerationPipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
print(pipeline("auto-gptq is")[0]["generated_text"])
```
## Conclusion
Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.