lyraChatGLM / README.md

Update README.md

3d947fe about 1 year ago

4.24 kB

	---
	license: creativeml-openrail-m
	language: en
	tags:
	- LLM
	- ChatGLM6B

	---


	## Breakings!

	We know what you want, and here they are!

	- Newly released lyraChatGLM model, suitable for Ampere(A100/A10) as well as Volta(V100)
	- lyraChatGLM has been further optimized, reaches 9000tokens/s on A100 and 3900 tokens/s on V100, about 5.5x faster than original version(2023/6/1).
	- The memory usage was optimized too, now we can set batch_size up to 256 on A100!

	Note that the code was fully updated too, you need to use new API, see `Uses` below


	## Model Card for lyraChatGLM

	lyraChatGLM is currently the fastest ChatGLM-6B available. To the best of our knowledge, it is the first accelerated version of ChatGLM-6B.

	The inference speed of lyraChatGLM has achieved 300x acceleration upon the ealry original version. We are still working hard to further improve the performance.

	Among its main features are:

	- weights: original ChatGLM-6B weights released by THUDM.
	- device: Nvidia GPU with Amperer architecture or Volta architecture (A100, A10, V100...).
	- batch_size: compiled with dynamic batch size, maximum depends on device.

	## Speed

	- orginal version(fixed batch infer): commit id 1d240ba

	### test on A100 40G

	\|version\|max_batch_size\|max_speed\|
	\|:-:\|:-:\|:-:\|
	\|original\|1\|30 tokens/s\|
	\|original(fxied batch infer)\|192\|1638.52 toekns/s\|
	\|lyraChatGLM(current)\|256\|9082.60+ tokens/s\|

	### test on V100
	\|version\|max_batch_size\|max_speed\|
	\|:-:\|:-:\|:-:\|
	\|original\|1\|17.83 tokens/s\|
	\|original(fxied batch infer)\|128\|992.20 toekns/s\|
	\|lyraChatGLM(current)\|192\|3911.45+ tokens/s\|

	## Model Sources

	- Repository: https://huggingface.co/THUDM/chatglm-6b

	## Docker Environment

	- docker image available at [https://hub.docker.com/repository/docker/bigmoyan/lyrallm/general], pull image by:

	```
	docker pull bigmoyan/lyrallm:v0.1
	```

	## Uses

	```python
	from lyraChatGLM import LyraChatGLM6B

	model_path = "./models/1-gpu-fp16.h5"
	tokenizer_path = "./models"
	data_type = "fp16"
	int8_mode = 0
	max_output_length = 150
	arch = "Ampere" # Ampere or Volta

	model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch)
	prompt = "列出3个不同的机器学习算法，并说明它们的适用范围."
	test_batch_size = 256

	prompts = [prompt, ]

	# If you want to get different output in same batch, you can set do_sample to True
	output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)

	print(output_texts)

	```
	## Demo output

	### input
	列出3个不同的机器学习算法，并说明它们的适用范围.

	### output
	以下是三个常见的机器学习算法及其适用范围:

	1. 决策树(Decision Tree):决策树是一种基于分类和回归问题的朴素贝叶斯模型。它通过构建一系列逐步分裂的分支来预测结果。适用于那些具有简单特征、大量数据且数据集大小在可接受范围内的情况。

	2. 随机森林(Random Forest):随机森林是一种集成学习算法,由多个决策树组成。它的优点是能够处理大规模数据和高维度的特征。适用于需要对多个变量进行建模的场景,例如医疗诊断、金融风险评估等。

	3. 支持向量机(Support Vector Machine):支持向量机是一种监督学习方法,通常用于分类问题。它可以处理高维数据,并且具有较高的准确性。适用于需要对高维数据进行分类或回归的问题,例如图像识别、自然语言处理等。

	## Citation
	``` bibtex
	@Misc{lyraChatGLM2023,
	author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
	title = {lyraChatGLM: Accelerating ChatGLM by 5.5x+},
	howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
	year = {2023}
	}
	```

	## Report bug
	- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
	- report bug with a `[bug]` mark in the title.