casperhansen
/

mpt-7b-8k-chat-awq

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

mpt-7b-8k-chat-awq / README.md

casperhansen's picture

Update README.md

75f4cb5 over 1 year ago

|

2.07 kB

	---
	{}
	---

	# MPT-7b-8k-chat

	This model is originally released under CC-BY-NC-SA-4.0, and the AWQ framework is MIT licensed.

	Original model can be found at [https://huggingface.co/mosaicml/mpt-7b-8k-chat](https://huggingface.co/mosaicml/mpt-7b-8k-chat).

	## ⚡ 4-bit Inference Speed

	Machines rented from RunPod - speed may vary dependent on both GPU/CPU.

	H100:
	- CUDA 12.0, Driver 525.105.17: 92 tokens/s (10.82 ms/token)

	RTX 4090 + Intel i9 13900K (2 different VMs):
	- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
	- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

	RTX 4090 + AMD EPYC 7-Series (3 different VMs):
	- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
	- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
	- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

	A6000 (2 different VMs):
	- CUDA 12.0, Driver 525.105.17: 61 tokens/s (16.31 ms/token)
	- CUDA 12.1, Driver 530.30.02: 46 tokens/s (21.79 ms/token)

	## How to run

	Install [AWQ](https://github.com/mit-han-lab/llm-awq):

	```sh
	git clone https://github.com/mit-han-lab/llm-awq && \
	cd llm-awq && \
	pip3 install -e . && \
	cd awq/kernels && \
	python3 setup.py install && \
	cd ../.. && \
	pip3 install einops
	```

	Run:

	```sh
	hfuser="casperhansen"
	model_name="mpt-7b-8k-chat-awq"
	group_size=128
	repo_path="$hfuser/$model_name"
	model_path="/workspace/llm-awq/$model_name"
	quantized_model_path="/workspace/llm-awq/$model_name/$model_name-w4-g$group_size.pt"

	git clone https://huggingface.co/$repo_path

	python3 tinychat/demo.py --model_type mpt \
	--model_path $model_path \
	--q_group_size $group_size \
	--load_quant $quantized_model_path \
	--precision W4A16
	```

	## Citation

	Please cite this model using the following format:

	```
	@online{MosaicML2023Introducing,
	author = {MosaicML NLP Team},
	title = {Introducing MPT-30B: Raising the bar
	for open-source foundation models},
	year = {2023},
	url = {www.mosaicml.com/blog/mpt-30b},
	note = {Accessed: 2023-06-22},
	urldate = {2023-06-22}
	}
	```