--- license: mit language: en tags: - LLM - XVERSE-13B-Chat --- ## Model Card for lyraXVERSE We have colaborated with XVERSE and lauched lyraXVERSE, currently the **fastest XVERSE-13b** available. The inference speed of lyraXVERSE has achieved up to **3900+ tokens/s** on A100, up to **2.7x** acceleration upon the torch version. Among its main features are: - device: Nvidia GPU with Amperer architecture or Volta architecture (A10, A100 or higher, V100). - batch_size: compiled with dynamic batch size, maximum depends on device.  - MEMOPT mode: significantly optimized VRAM usage and increased speed We use the XVERSE-13B-Chat model for measurement, but this optimized inference is also applicable to XVERSE-13B model. ## Speed * Evaluated at tokens/s * test on A100 40G * MEMOPT mode ### XVERSE-13B-Chat | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch | 34.8 | 249.2 | 470.1 | 878.6 | 1478.9 | | lyraXVERSE | 96.6 | 725.5 | 1359.3 | 2415.6 | 3923.2 | ## Docker Environment Recommendation - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` ```bash docker pull nvcr.io/nvidia/pytorch:23.02-py3 docker run --rm -it --gpus all -v ./:/lyraXVERSE nvcr.io/nvidia/pytorch:23.02-py3 pip install -r requirements.txt python demo.py ``` ## Uses ```python from lyra_xverse import lyraXVERSE model_path = "./models/" tokenizer_path = "./models/" inference_dtype = 'fp16' prompt = "讲个故事:" memopt_mode = 1 max_output_length = 512 arch = "Ampere" # Ampere or Volta cuda_version = 12 # cuda version, we currently support 11 and 12 model = lyraXVERSE(model_path, tokenizer_path = tokenizer_path, dtype = inference_dtype, memopt_mode = memopt_mode, arch = arch, cuda_version = cuda_version) bs = 1 prompts = [prompt, ] * bs output_texts = model.generate( prompts, output_length=max_output_length, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False) print(output_texts) ``` ## Demo Outputs ### XVERSE-13B-Chat #### input 讲个故事: #### output 有一天,一位年轻的画家来到了一个偏远的村庄。他以其超凡的绘画技巧,为村民画了一幅美丽的图画。图画里,村庄的周围是翠绿的森林,清澈的溪流在其中流淌,村民们正在劳作,孩子们在田野里嬉戏。村民们看着这幅画,都对这位画家赞不绝口。 村庄的领袖看到了这幅画,他想:“这幅画将会让我们的村庄更加美丽,我们应该让村民们知道这幅画。”于是,他带着画家去村庄的各个角落,让每一个村民都看到了这幅画。 画家看着村民们看画的眼神,他意识到了自己的价值。他意识到,他不仅仅是一个画家,他也是一个能让人们看见希望的人。他的画不仅仅是艺术品,它是连接人们与希望的一座桥梁。 这个故事告诉我们,画家的价值不只是他们的绘画技巧,而是他们的画作带给人们的感动和希望。画家的价值并不在于他们的画有多么昂贵,有多么独特,而在于他们能用画作打开人们的心扉,让人们看见希望,看见生活的美好。 ## Citation ``` bibtex @Misc{lyraXVERSE2023,   author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},   title =        {lyraXVERSE: Accelerating XVERSE-13B-Chat(fp16) to 3000+ tokens/s},   howpublished = {\url{https://huggingface.co/TMElyralab/lyraXVERSE}},   year =         {2023} } ``` ## Report bugs - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraXVERSE - report bug with a `[bug]` mark in the title.