Edit model card

Special Notes

We have released 7 lyraBaichuan models including lyraBaichuan-7B, lyraBaichuan-13B-Base, lyraBaichuan-13B-Chat, lyraBaichuan2-7B-Base, lyraBaichuan2-7B-Chat, lyraBaichuan2-13B-Base and lyraBaichuan2-13B-Chat.

These highly optimized Baichuan models are suitable for Ampere (A100/A10) as well as Volta (V100).

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com.

Model Card for lyraBaichuan

lyraBaichuan is currently the fastest Baichuan models (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to 4300+ tokens/s on A100, up to 2.4x acceleration upon the torch version.

Among its main features are:

  • device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
  • batch_size: compiled with dynamic batch size, maximum depends on device. 
  • MEMOPT mode: significantly optimized VRAM usage and increased speed

We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.

Speed

  • Evaluated at tokens/s (#tokens of input and output divided by inference time cost)
  • test on A100 40G
  • MEMOPT mode

Baichuan2-7B-Base

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 41.2 323.2 640.0 1256.8 2231.0
lyraBaichuan 125.9 948.1 1749.3 2974.0 4370.1

Baichuan2-13B-Base

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 40.9 307.9 555.6 1010.4 1601.0
lyraBaichuan 80.0 568.2 1124.4 1942.6 2828.0

Docker Environment Recommendation

  • For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
  • For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py

Uses

from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B

model_path = "./models/Baichuan2-13B-lyra"
tokenizer_path = "./models/Baichuan2-13B-lyra"
inference_dtype = 'fp16'
prompt = "登鹳雀楼->王之涣\n夜雨寄北->"

memopt_mode = 1
max_output_length = 64
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12

# To use 7B model, initialize with lyraBaichuan7B
model = lyraBaichuan13B(model_path, 
                        tokenizer_path = tokenizer_path, 
                        dtype = inference_dtype,
                        memopt_mode = memopt_mode,
                        arch = arch,
                        cuda_version = cuda_version)

bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
        prompts, output_length=max_output_length,
        top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

print(output_texts)

Demo Outputs

Baichuan2-13B-Base

input

登鹳雀楼->王之涣

夜雨寄北->

output

李商隐

望洞庭->刘禹锡

黄鹤楼送孟浩然之广陵->李白

登岳阳楼->杜甫

秋词->刘禹锡

枫桥夜泊->张继

饮湖上初晴后雨->苏轼

浪淘沙->刘禹锡

TODO

  1. Support for int4
  2. Inference for longer context situations
  3. Streaming inference mode.

Citation

@Misc{lyraBaichuan2023,
  author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
  year =         {2023}
}

Report bug

Downloads last month
0
Unable to determine this model's library. Check the docs .