lyraBaichuan / README.md

carsonhxsu

Merge branch 'main' of https://huggingface.co/TMElyralab/lyraBaichuan into main

d25ff99 about 1 year ago

3.83 kB

	---
	license: mit
	language: en
	tags:
	- LLM
	- Baichuan-7B
	- Baichuan-13B-base
	- Baichuan-13B-chat
	- Baichuan2-7B-base
	- Baichuan2-13B-base
	- Baichuan2-7B-chat
	- Baichuan2-13B-chat
	---
	## Model Card for lyraBaichuan

	lyraBaichuan is currently the fastest Baichuan models (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to 4300+ tokens/s on A100, up to 2.4x acceleration upon the torch version.

	Among its main features are:
	- device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
	- batch_size: compiled with dynamic batch size, maximum depends on device.
	- MEMOPT mode: significantly optimized VRAM usage and increased speed

	We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.

	## Speed

	* Evaluated at tokens/s
	* test on A100 40G
	* MEMOPT mode

	### Baichuan2-7B-Base

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.0.1 \| 41.2 \| 323.2 \| 640.0 \| 1256.8 \| 2231.0 \|
	\| lyraBaichuan \| 125.9 \| 948.1 \| 1749.3 \| 2974.0 \| 4370.1 \|

	### Baichuan2-13B-Base

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch 2.0.1 \| 40.9 \| 307.9 \| 555.6 \| 1010.4 \| 1601.0 \|
	\| lyraBaichuan \| 80.0 \| 568.2 \| 1124.4 \| 1942.6 \| 2828.0 \|

	## Docker Environment Recommendation

	- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
	- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

	```bash
	docker pull nvcr.io/nvidia/pytorch:23.02-py3
	docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3

	pip install -r requirements.txt
	python demo.py
	```

	## Uses

	```python
	from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B

	model_path = "./models/Baichuan2-13B-lyra"
	tokenizer_path = "./models/Baichuan2-13B-lyra"
	inference_dtype = 'fp16'
	prompt = "登鹳雀楼->王之涣\n夜雨寄北->"

	memopt_mode = 1
	max_output_length = 64
	arch = "Ampere" # Ampere or Volta
	cuda_version = 12 # cuda version, we currently support 11 and 12

	# To use 7B model, initialize with lyraBaichuan7B
	model = lyraBaichuan13B(model_path,
	tokenizer_path = tokenizer_path,
	dtype = inference_dtype,
	memopt_mode = memopt_mode,
	arch = arch,
	cuda_version = cuda_version)

	bs = 1
	prompts = [prompt, ] * bs
	output_texts = model.generate(
	prompts, output_length=max_output_length,
	top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

	print(output_texts)
	```

	## Demo Outputs

	### Baichuan2-13B-Base
	#### input

	登鹳雀楼->王之涣

	夜雨寄北->

	#### output

	李商隐

	望洞庭->刘禹锡

	黄鹤楼送孟浩然之广陵->李白

	登岳阳楼->杜甫

	秋词->刘禹锡

	枫桥夜泊->张继

	饮湖上初晴后雨->苏轼

	浪淘沙->刘禹锡

	## TODO
	1. Support for int4
	2. Inference for longer context situations
	3. Streaming inference mode.

	## Citation
	``` bibtex
	@Misc{lyraBaichuan2023,
	author = {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
	title = {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
	howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
	year = {2023}
	}
	```

	## Report bug
	- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan
	- report bug with a `[bug]` mark in the title.