|
--- |
|
license: mit |
|
language: en |
|
tags: |
|
- LLM |
|
- Baichuan-7B |
|
- Baichuan-13B-base |
|
- Baichuan-13B-chat |
|
- Baichuan2-7B-base |
|
- Baichuan2-13B-base |
|
- Baichuan2-7B-chat |
|
- Baichuan2-13B-chat |
|
--- |
|
## Model Card for lyraBaichuan |
|
|
|
lyraBaichuan is currently the **fastest Baichuan models** (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to **4300+ tokens/s** on A100, up to **2.4x** acceleration upon the torch version. |
|
|
|
Among its main features are: |
|
- device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100). |
|
- batch_size: compiled with dynamic batch size, maximum depends on device. |
|
- MEMOPT mode: significantly optimized VRAM usage and increased speed |
|
|
|
We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B. |
|
|
|
## Speed |
|
|
|
* Evaluated at tokens/s |
|
* test on A100 40G |
|
* MEMOPT mode |
|
|
|
### Baichuan2-7B-Base |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 | |
|
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 | |
|
|
|
### Baichuan2-13B-Base |
|
|
|
| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | |
|
| --- | --- | --- | --- | --- | --- | |
|
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 | |
|
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 | |
|
|
|
## Docker Environment Recommendation |
|
|
|
- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` |
|
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` |
|
|
|
```bash |
|
docker pull nvcr.io/nvidia/pytorch:23.02-py3 |
|
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3 |
|
|
|
pip install -r requirements.txt |
|
python demo.py |
|
``` |
|
|
|
## Uses |
|
|
|
```python |
|
from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B |
|
|
|
model_path = "./models/Baichuan2-13B-lyra" |
|
tokenizer_path = "./models/Baichuan2-13B-lyra" |
|
inference_dtype = 'fp16' |
|
prompt = "登鹳雀楼->王之涣\n夜雨寄北->" |
|
|
|
memopt_mode = 1 |
|
max_output_length = 64 |
|
arch = "Ampere" # Ampere or Volta |
|
cuda_version = 12 # cuda version, we currently support 11 and 12 |
|
|
|
# To use 7B model, initialize with lyraBaichuan7B |
|
model = lyraBaichuan13B(model_path, |
|
tokenizer_path = tokenizer_path, |
|
dtype = inference_dtype, |
|
memopt_mode = memopt_mode, |
|
arch = arch, |
|
cuda_version = cuda_version) |
|
|
|
bs = 1 |
|
prompts = [prompt, ] * bs |
|
output_texts = model.generate( |
|
prompts, output_length=max_output_length, |
|
top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False) |
|
|
|
print(output_texts) |
|
``` |
|
|
|
## Demo Outputs |
|
|
|
### Baichuan2-13B-Base |
|
#### input |
|
|
|
登鹳雀楼->王之涣 |
|
|
|
夜雨寄北-> |
|
|
|
#### output |
|
|
|
李商隐 |
|
|
|
望洞庭->刘禹锡 |
|
|
|
黄鹤楼送孟浩然之广陵->李白 |
|
|
|
登岳阳楼->杜甫 |
|
|
|
秋词->刘禹锡 |
|
|
|
枫桥夜泊->张继 |
|
|
|
饮湖上初晴后雨->苏轼 |
|
|
|
浪淘沙->刘禹锡 |
|
|
|
## TODO |
|
1. Support for int4 |
|
2. Inference for longer context situations |
|
3. Streaming inference mode. |
|
|
|
## Citation |
|
``` bibtex |
|
@Misc{lyraBaichuan2023, |
|
author = {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu}, |
|
title = {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s}, |
|
howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}}, |
|
year = {2023} |
|
} |
|
``` |
|
|
|
## Report bug |
|
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan |
|
- report bug with a `[bug]` mark in the title. |
|
|