Spaces:

zjasper666
/

bf16_vs_fp8

Runtime error

App Files Files Community

bf16_vs_fp8 / docs /gptq.md

zjasper666

Upload folder using huggingface_hub

8655a4b verified 3 months ago

preview code

raw

history blame contribute delete

2.37 kB

	# GPTQ 4bit Inference

	Support GPTQ 4bit inference with [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).

	1. Window user: use the `old-cuda` branch.
	2. Linux user: recommend the `fastest-inference-4bit` branch.

	## Install

	Setup environment:
	```bash
	# cd /path/to/FastChat
	git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git repositories/GPTQ-for-LLaMa
	cd repositories/GPTQ-for-LLaMa
	# Window's user should use the `old-cuda` branch
	git switch fastest-inference-4bit
	# Install `quant-cuda` package in FastChat's virtualenv
	python3 setup_cuda.py install
	pip3 install texttable
	```

	Chat with the CLI:
	```bash
	python3 -m fastchat.serve.cli \
	--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
	--gptq-wbits 4 \
	--gptq-groupsize 128
	```

	Start model worker:
	```bash
	# Download quantized model from huggingface
	# Make sure you have git-lfs installed (https://git-lfs.com)
	git lfs install
	git clone https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g models/vicuna-7B-1.1-GPTQ-4bit-128g

	python3 -m fastchat.serve.model_worker \
	--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
	--gptq-wbits 4 \
	--gptq-groupsize 128

	# You can specify which quantized model to use
	python3 -m fastchat.serve.model_worker \
	--model-path models/vicuna-7B-1.1-GPTQ-4bit-128g \
	--gptq-ckpt models/vicuna-7B-1.1-GPTQ-4bit-128g/vicuna-7B-1.1-GPTQ-4bit-128g.safetensors \
	--gptq-wbits 4 \
	--gptq-groupsize 128 \
	--gptq-act-order
	```

	## Benchmark

	\| LLaMA-13B \| branch \| Bits \| group-size \| memory(MiB) \| PPL(c4) \| Median(s/token) \| act-order \| speed up \|
	\| --------- \| ---------------------- \| ---- \| ---------- \| ----------- \| ------- \| --------------- \| --------- \| -------- \|
	\| FP16 \| fastest-inference-4bit \| 16 \| - \| 26634 \| 6.96 \| 0.0383 \| - \| 1x \|
	\| GPTQ \| triton \| 4 \| 128 \| 8590 \| 6.97 \| 0.0551 \| - \| 0.69x \|
	\| GPTQ \| fastest-inference-4bit \| 4 \| 128 \| 8699 \| 6.97 \| 0.0429 \| true \| 0.89x \|
	\| GPTQ \| fastest-inference-4bit \| 4 \| 128 \| 8699 \| 7.03 \| 0.0287 \| false \| 1.33x \|
	\| GPTQ \| fastest-inference-4bit \| 4 \| -1 \| 8448 \| 7.12 \| 0.0284 \| false \| 1.44x \|