lyraBaichuan / README.md
carsonhxsu
Merge branch 'main' of https://huggingface.co/TMElyralab/lyraBaichuan into main
d25ff99
|
raw
history blame
3.83 kB
---
license: mit
language: en
tags:
- LLM
- Baichuan-7B
- Baichuan-13B-base
- Baichuan-13B-chat
- Baichuan2-7B-base
- Baichuan2-13B-base
- Baichuan2-7B-chat
- Baichuan2-13B-chat
---
## Model Card for lyraBaichuan
lyraBaichuan is currently the **fastest Baichuan models** (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to **4300+ tokens/s** on A100, up to **2.4x** acceleration upon the torch version.
Among its main features are:
- device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
- batch_size: compiled with dynamic batch size, maximum depends on device. 
- MEMOPT mode: significantly optimized VRAM usage and increased speed
We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.
## Speed
* Evaluated at tokens/s
* test on A100 40G
* MEMOPT mode
### Baichuan2-7B-Base
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |
### Baichuan2-13B-Base
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |
## Docker Environment Recommendation
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```
```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3
pip install -r requirements.txt
python demo.py
```
## Uses
```python
from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B
model_path = "./models/Baichuan2-13B-lyra"
tokenizer_path = "./models/Baichuan2-13B-lyra"
inference_dtype = 'fp16'
prompt = "登鹳雀楼->王之涣\n夜雨寄北->"
memopt_mode = 1
max_output_length = 64
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12
# To use 7B model, initialize with lyraBaichuan7B
model = lyraBaichuan13B(model_path,
tokenizer_path = tokenizer_path,
dtype = inference_dtype,
memopt_mode = memopt_mode,
arch = arch,
cuda_version = cuda_version)
bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
prompts, output_length=max_output_length,
top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)
print(output_texts)
```
## Demo Outputs
### Baichuan2-13B-Base
#### input
登鹳雀楼->王之涣
夜雨寄北->
#### output
李商隐
望洞庭->刘禹锡
黄鹤楼送孟浩然之广陵->李白
登岳阳楼->杜甫
秋词->刘禹锡
枫桥夜泊->张继
饮湖上初晴后雨->苏轼
浪淘沙->刘禹锡
## TODO
1. Support for int4
2. Inference for longer context situations
3. Streaming inference mode.
## Citation
``` bibtex
@Misc{lyraBaichuan2023,
  author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
  year =         {2023}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan
- report bug with a `[bug]` mark in the title.