Model Card for lyraLLMs

Introduction

We have released lyraLLMs, a highly optimized and easy-to-use inference engine for LLMs.

lyraLLMs is suitable for NVIDIA GPUs:

Volta (V100)
Turing (T4)
Ampere (A100/A10)
Ada Lovelace (RTX 4090, etc.)

lyraLLMs supports many popular HuggingFace models as follows:

BELLE
ChatGLM
LLaMA
LLaMA 2
XVERSE
Baichuan 1 & 2

lyraLLMs is fast, memory-efficient & easy to use with:

State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
Efficient memory usage of attention with FlashAttention2
Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
Easy-to-use Python API to serve LLMs
Streaming outputs

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com

Speed

Settings

Evaluated at tokens/s (input + output)
Test on A100 40G, CUDA 12.0
Enable the use of MEMOPT mode and KVCache Int8

Throughputs

XVERSE-13B-Chat

Input

北京的景点：故宫、天坛、万里长城等。\n深圳的景点：

Version	Batch Size 1	Batch Size 64	Batch Size 128	Batch Size 256	Batch Size 512
Torch 2.1.0	52.9	2308.1	OOM
lyraXVERSE	200.4	4624.8	5759.7	6075.6	5733

Baichuan2-7B-Base

Input

北京的景点：登鹳雀楼->王之涣\n夜雨寄北->

Version	Batch Size 1	Batch Size 8	Batch Size 16	Batch Size 32	Batch Size 64
Torch 2.0.1	41.2	323.2	640.0	1256.8	2231.0
lyraBaichuan	125.9	948.1	1749.3	2974.0	4370.1

Baichuan2-13B-Base

Input

北京的景点：登鹳雀楼->王之涣\n夜雨寄北->

Version	Batch Size 1	Batch Size 8	Batch Size 16	Batch Size 32	Batch Size 64
Torch 2.0.1	40.9	307.9	555.6	1010.4	1601.0
lyraBaichuan	80.0	568.2	1124.4	1942.6	2828.0

Yi-6B

Input

# write the quick sort algorithm

Version	Batch Size 1	Batch Size 8	Batch Size 16	Batch Size 32	Batch Size 64
Torch 2.1.0	31.4	247.5	490.4	987.2	1796.3
lyraLLaMA	93.8	735.6	2339.8	3020.9	4630.8

Yi-34B

Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.

Input

Let me tell you an interesting story about cat Tom and mouse Jerry,

Version	Batch Size 1	Batch Size 8	Batch Size 16	Batch Size 32	Batch Size 64
lyraLLaMA	52.5	399.4	753.0	1138.2	1926.2

Usage

Environment (Docker recommended)

For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3

docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt

Convert Models

We have released multiple optimized models converted from original HuggingFace ones:

ChatGLM-6B
XVERSE-13B-Chat
LLaMA-Ziya-13B
Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
Yi-6B, Yi-34B

Feel free to contact us if you would like to convert a finetuned version of LLMs.

Inference

Refer to README.md for inference of converted models with lyraLLMs.

Python Demo

from lyra_llama import lyraLlama

model_path = 'XXX' # 包含转换后的模型参数，配置，tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1

model = lyraLlama(model_path, data_type, memopt_mode)

prompts = '列出3个不同的机器学习算法，并说明它们的适用范围.'
prompts = [prompts,] * 64

output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)

Citation

@Misc{lyraLLMs2024,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
  title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
  year =         {2024}
}

Report bug

start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions
report bug with a [bug] mark in the title.