README.md · TMElyralab/lyraChatGLM at refs/pr/21

metadata

license: creativeml-openrail-m
language: en
tags:
  - LLM
  - ChatGLM6B

Breakings!

We know what you want, and here they are!

Newly released lyraChatGLM model, suitable for Ampere(A100/A10) as well as Volta(V100)
lyraChatGLM has been further optimized, reaches 9000tokens/s on A100 and 3900 tokens/s on V100, about 5.5x faster than original version(2023/6/1).
The memory usage was optimized too, now we can set batch_size up to 256 on A100!

Note that the code was fully updated too, you need to use new API, see Uses below

Model Card for lyraChatGLM

lyraChatGLM is currently the fastest ChatGLM-6B available. To the best of our knowledge, it is the first accelerated version of ChatGLM-6B.

The inference speed of lyraChatGLM has achieved 300x acceleration upon the ealry original version. We are still working hard to further improve the performance.

Among its main features are:

weights: original ChatGLM-6B weights released by THUDM.
device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...).
batch_size: compiled with dynamic batch size, maximum depends on device.

Speed

orginal version(fixed batch infer): commit id 1d240ba

test on A100 40G

version	max_batch_size	max_speed
original	1	30 tokens/s
original(fxied batch infer)	192	1638.52 toekns/s
lyraChatGLM(current)	256	9082.60+ tokens/s

test on V100

version	max_batch_size	max_speed
original	1	17.83 tokens/s
original(fxied batch infer)	128	992.20 toekns/s
lyraChatGLM(current)	192	3911.45+ tokens/s

Model Sources

Repository: https://huggingface.co/THUDM/chatglm-6b

Docker Environment

docker image available at [https://hub.docker.com/repository/docker/bigmoyan/lyrallm/general], pull image by:

docker pull bigmoyan/lyrallm:v0.1

Uses

from lyraChatGLM import LyraChatGLM6B

model_path = "./models/1-gpu-fp16.h5"
tokenizer_path = "./models"
data_type = "fp16"
int8_mode = 0
max_output_length = 150
arch = "Ampere" # Ampere or Volta

model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch)
prompt = "列出3个不同的机器学习算法，并说明它们的适用范围."
test_batch_size = 256

prompts = [prompt, ]

# If you want to get different output in same batch, you can set do_sample to True
output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)

print(output_texts)

Demo output

input

列出3个不同的机器学习算法，并说明它们的适用范围.

output

以下是三个常见的机器学习算法及其适用范围:

决策树(Decision Tree):决策树是一种基于分类和回归问题的朴素贝叶斯模型。它通过构建一系列逐步分裂的分支来预测结果。适用于那些具有简单特征、大量数据且数据集大小在可接受范围内的情况。
随机森林(Random Forest):随机森林是一种集成学习算法,由多个决策树组成。它的优点是能够处理大规模数据和高维度的特征。适用于需要对多个变量进行建模的场景,例如医疗诊断、金融风险评估等。
支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。

Citation

@Misc{lyraChatGLM2023,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraChatGLM: Accelerating ChatGLM by 5.5x+},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
  year =         {2023}
}

Report bug

start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
report bug with a [bug] mark in the title.