ethzanalytics
/

stablelm-tuned-alpha-3b-gptq-4bit-128g

Text Generation

Model card Files Files and versions Community

stablelm-tuned-alpha-3b-gptq-4bit-128g / README.md

pszemraj's picture

Update README.md

62b3563 over 1 year ago

|

2.64 kB

	---
	license: cc-by-nc-sa-4.0
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- gptq
	- auto-gptq
	- quantized
	---

	# stablelm-tuned-alpha-3b-gptq-4bit-128g

	This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally.

	See the below [excerpt from the tutorial](https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) for details.

	---

	# Auto-GPTQ Quick Start

	## Quick Installation

	Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
	```shell
	pip install auto-gptq
	```

	AutoGPTQ supports using `triton` to speedup inference, but it currently only supports Linux. To integrate triton, using:
	```shell
	pip install auto-gptq[triton]
	```

	For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
	```shell
	pip install auto-gptq[llama]
	```

	By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.

	To disable building CUDA extension, you can use the following commands:

	For Linux
	```shell
	BUILD_CUDA_EXT=0 pip install auto-gptq
	```
	For Windows
	```shell
	set BUILD_CUDA_EXT=0 && pip install auto-gptq
	```

	## Basic Usage
	The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`

	The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
	```python
	from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
	```
	### Load quantized model and do inference

	Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
	```python
	device = "cuda:0"
	model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
	```
	This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU.

	Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
	```python
	from transformers import TextGenerationPipeline

	pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
	print(pipeline("auto-gptq is")[0]["generated_text"])
	```

	## Conclusion
	Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.