lyraXVERSE / README.md

Update README.md

0de23f4 9 months ago

No virus

3.93 kB

	---
	license: mit
	language: en
	tags:
	- LLM
	- XVERSE-13B-Chat
	---
	## Model Card for lyraXVERSE

	We have colaborated with XVERSE and lauched lyraXVERSE, currently the fastest XVERSE-13b available. The inference speed of lyraXVERSE has achieved up to 3900+ tokens/s on A100, up to 2.7x acceleration upon the torch version.

	Among its main features are:
	- device: Nvidia GPU with Amperer architecture or Volta architecture (A10, A100 or higher, V100).
	- batch_size: compiled with dynamic batch size, maximum depends on device.
	- MEMOPT mode: significantly optimized VRAM usage and increased speed

	We use the XVERSE-13B-Chat model for measurement, but this optimized inference is also applicable to XVERSE-13B model.

	## Speed

	* Evaluated at tokens/s
	* test on A100 40G
	* MEMOPT mode

	### XVERSE-13B-Chat

	\| Version \| Batch Size 1 \| Batch Size 8 \| Batch Size 16 \| Batch Size 32 \| Batch Size 64 \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| Torch \| 34.8 \| 249.2 \| 470.1 \| 878.6 \| 1478.9 \|
	\| lyraXVERSE \| 96.6 \| 725.5 \| 1359.3 \| 2415.6 \| 3923.2 \|

	## Docker Environment Recommendation

	- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
	- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

	```bash
	docker pull nvcr.io/nvidia/pytorch:23.02-py3
	docker run --rm -it --gpus all -v ./:/lyraXVERSE nvcr.io/nvidia/pytorch:23.02-py3

	pip install -r requirements.txt
	python demo.py
	```

	## Uses

	```python
	from lyra_xverse import lyraXVERSE

	model_path = "./models/"
	tokenizer_path = "./models/"
	inference_dtype = 'fp16'
	prompt = "讲个故事:"
	memopt_mode = 1
	max_output_length = 512
	arch = "Ampere" # Ampere or Volta
	cuda_version = 12 # cuda version, we currently support 11 and 12

	model = lyraXVERSE(model_path,
	tokenizer_path = tokenizer_path,
	dtype = inference_dtype,
	memopt_mode = memopt_mode,
	arch = arch,
	cuda_version = cuda_version)

	bs = 1
	prompts = [prompt, ] * bs
	output_texts = model.generate(
	prompts, output_length=max_output_length,
	top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

	print(output_texts)

	```

	## Demo Outputs

	### XVERSE-13B-Chat
	#### input

	讲个故事:

	#### output

	有一天,一位年轻的画家来到了一个偏远的村庄。他以其超凡的绘画技巧,为村民画了一幅美丽的图画。图画里,村庄的周围是翠绿的森林,清澈的溪流在其中流淌,村民们正在劳作,孩子们在田野里嬉戏。村民们看着这幅画,都对这位画家赞不绝口。

	村庄的领袖看到了这幅画,他想:“这幅画将会让我们的村庄更加美丽,我们应该让村民们知道这幅画。”于是,他带着画家去村庄的各个角落,让每一个村民都看到了这幅画。

	画家看着村民们看画的眼神,他意识到了自己的价值。他意识到,他不仅仅是一个画家,他也是一个能让人们看见希望的人。他的画不仅仅是艺术品,它是连接人们与希望的一座桥梁。

	这个故事告诉我们,画家的价值不只是他们的绘画技巧,而是他们的画作带给人们的感动和希望。画家的价值并不在于他们的画有多么昂贵,有多么独特,而在于他们能用画作打开人们的心扉,让人们看见希望,看见生活的美好。

	## Citation
	``` bibtex
	@Misc{lyraXVERSE2023,
	author = {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
	title = {lyraXVERSE: Accelerating XVERSE-13B-Chat(fp16) to 3000+ tokens/s},
	howpublished = {\url{https://huggingface.co/TMElyralab/lyraXVERSE}},
	year = {2023}
	}
	```

	## Report bugs
	- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraXVERSE
	- report bug with a `[bug]` mark in the title.