--- license: mit language: en tags: - LLM - Baichuan-7B - Baichuan-13B-base - Baichuan-13B-chat - Baichuan2-7B-base - Baichuan2-13B-base - Baichuan2-7B-chat - Baichuan2-13B-chat --- ## Model Card for lyraBaichuan lyraBaichuan is currently the **fastest Baichuan models** (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to **4300+ tokens/s** on A100, up to **2.4x** acceleration upon the torch version. Among its main features are: - device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100). - batch_size: compiled with dynamic batch size, maximum depends on device.  - MEMOPT mode: significantly optimized VRAM usage and increased speed We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B. ## Speed * Evaluated at tokens/s * test on A100 40G * MEMOPT mode ### Baichuan2-7B-Base | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 | | lyraXVERSE MEMOPT | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 | ### Baichuan2-13B-Base | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 | | lyraXVERSE MEMOPT | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 | ## Docker Environment Recommendation - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` ```bash docker pull nvcr.io/nvidia/pytorch:23.02-py3 docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3 pip install -r requirements.txt python demo.py ``` ## Uses ```python from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B model_path = "./models/Baichuan2-13B-lyra" tokenizer_path = "./models/Baichuan2-13B-lyra" inference_dtype = 'fp16' prompt = "登鹳雀楼->王之涣\n夜雨寄北->" memopt_mode = 1 max_output_length = 64 arch = "Ampere" # Ampere or Volta cuda_version = 12 # cuda version, we currently support 11 and 12 # To use 7B model, initialize with lyraBaichuan7B model = lyraBaichuan13B(model_path, tokenizer_path = tokenizer_path, dtype = inference_dtype, memopt_mode = memopt_mode, arch = arch, cuda_version = cuda_version) bs = 1 prompts = [prompt, ] * bs output_texts = model.generate( prompts, output_length=max_output_length, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False) print(output_texts) ``` ## Demo Outputs ### Baichuan2-13B-Base #### input 登鹳雀楼->王之涣 夜雨寄北-> #### output 李商隐 望洞庭->刘禹锡 黄鹤楼送孟浩然之广陵->李白 登岳阳楼->杜甫 秋词->刘禹锡 枫桥夜泊->张继 饮湖上初晴后雨->苏轼 浪淘沙->刘禹锡 ## TODO 1. Support for int4 2. Inference for longer context situations 3. Streaming inference mode. ## Citation ``` bibtex @Misc{lyraBaichuan2023,   author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},   title =        {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},   howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},   year =         {2023} } ``` ## Report bug - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan - report bug with a `[bug]` mark in the title.