File size: 3,827 Bytes
4f41ca5 f50d964 fa2f67e 4f41ca5 f50d964 b7d7dcc f50d964 b7d7dcc f50d964 fa2f67e f50d964 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
license: mit
language: en
tags:
- LLM
- Baichuan-7B
- Baichuan-13B-base
- Baichuan-13B-chat
- Baichuan2-7B-base
- Baichuan2-13B-base
- Baichuan2-7B-chat
- Baichuan2-13B-chat
---
## Model Card for lyraBaichuan
lyraBaichuan is currently the **fastest Baichuan models** (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to **4300+ tokens/s** on A100, up to **2.4x** acceleration upon the torch version.
Among its main features are:
- device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
- batch_size: compiled with dynamic batch size, maximum depends on device.
- MEMOPT mode: significantly optimized VRAM usage and increased speed
We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.
## Speed
* Evaluated at tokens/s
* test on A100 40G
* MEMOPT mode
### Baichuan2-7B-Base
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |
### Baichuan2-13B-Base
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |
## Docker Environment Recommendation
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```
```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3
pip install -r requirements.txt
python demo.py
```
## Uses
```python
from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B
model_path = "./models/Baichuan2-13B-lyra"
tokenizer_path = "./models/Baichuan2-13B-lyra"
inference_dtype = 'fp16'
prompt = "登鹳雀楼->王之涣\n夜雨寄北->"
memopt_mode = 1
max_output_length = 64
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12
# To use 7B model, initialize with lyraBaichuan7B
model = lyraBaichuan13B(model_path,
tokenizer_path = tokenizer_path,
dtype = inference_dtype,
memopt_mode = memopt_mode,
arch = arch,
cuda_version = cuda_version)
bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
prompts, output_length=max_output_length,
top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)
print(output_texts)
```
## Demo Outputs
### Baichuan2-13B-Base
#### input
登鹳雀楼->王之涣
夜雨寄北->
#### output
李商隐
望洞庭->刘禹锡
黄鹤楼送孟浩然之广陵->李白
登岳阳楼->杜甫
秋词->刘禹锡
枫桥夜泊->张继
饮湖上初晴后雨->苏轼
浪淘沙->刘禹锡
## TODO
1. Support for int4
2. Inference for longer context situations
3. Streaming inference mode.
## Citation
``` bibtex
@Misc{lyraBaichuan2023,
author = {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
title = {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
year = {2023}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan
- report bug with a `[bug]` mark in the title.
|