File size: 2,004 Bytes
1990bc5 7a52ff4 1990bc5 7a52ff4 5ad7a4b 7a52ff4 19f9426 7a52ff4 d81efc2 7a52ff4 d81efc2 7a52ff4 d81efc2 7a52ff4 d81efc2 7a52ff4 d81efc2 7a52ff4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
{}
---
# MPT-7b-8k-chat
This model is originally released under CC-BY-NC-SA-4.0, and the AWQ framework is MIT licensed.
Original model can be found at [https://huggingface.co/mosaicml/mpt-7b-8k-chat](https://huggingface.co/mosaicml/mpt-7b-8k-chat).
## ⚡ 4-bit Inference Speed
This was tested on RunPod. Speed varies across machines, I have not been able to reproduce 117 tokens/s consistently on a 4090 yet.
H100:
- CUDA 12.0, Driver 525.105.17: 92 tokens/s (10.82 ms/token)
RTX 4090 (4 different VMs):
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
A6000 (2 different VMs):
- CUDA 12.0, Driver 525.105.17: 61 tokens/s (16.31 ms/token)
- CUDA 12.1, Driver 530.30.02: 46 tokens/s (21.79 ms/token)
## How to run
Install [AWQ](https://github.com/mit-han-lab/llm-awq):
```sh
git clone https://github.com/mit-han-lab/llm-awq && \
cd llm-awq && \
pip3 install -e . && \
cd awq/kernels && \
python3 setup.py install && \
cd ../.. && \
pip3 install einops
```
Run:
```sh
hfuser="casperhansen"
model_name="mpt-7b-8k-chat-awq"
group_size=128
repo_path="$hfuser/$model_name"
model_path="/workspace/llm-awq/$model_name"
quantized_model_path="/workspace/llm-awq/$model_name/$model_name-w4-g$group_size.pt"
git clone https://huggingface.co/$repo_path
python3 tinychat/demo.py --model_type mpt \
--model_path $model_path \
--q_group_size $group_size \
--load_quant $quantized_model_path \
--precision W4A16
```
## Citation
Please cite this model using the following format:
```
@online{MosaicML2023Introducing,
author = {MosaicML NLP Team},
title = {Introducing MPT-30B: Raising the bar
for open-source foundation models},
year = {2023},
url = {www.mosaicml.com/blog/mpt-30b},
note = {Accessed: 2023-06-22},
urldate = {2023-06-22}
}
``` |