--- license: mit language: en tags: - LLM - LLaMA-13b --- ## Special Notes Due to the license regulation of LLaMA, **we are not allowed to release the accelerated parameters directly**. We hope to discuss with you guys to figure out **a legal way** to share lyraLLaMA. If you have any suggestions, please feel free to drop us a line at benbinwu@tencent.com. ## Model Card for lyraLLaMA lyraLLaMA is currently the **fastest LLaMA-13b** available. The inference speed of lyraLLaMA has achieved **3000+ tokens/s** on A100, up to **6x** acceleration upon the torch version. Among its main features are: - device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100). - batch_size: compiled with dynamic batch size, maximum depends on device.  We use the LLaMA.13B model for measurement, but this optimized inference is applicable to LLaMA models of different sizes. ## Speed * Evaluated at tokens/s * test on A100 40G * fp16 and MEMOPT precision ### LLaMA-Ziya-13B | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch LLaMA | 31.74 | 289.2 | 521.37 | 775.69 | OOM | | lyraLLaMA fp16 | 73.2 | 565.6 | 1179.59 | 1795.63 | 3061.27 | | lyraLLaMA MEMOPT | 104 | 770.5 | 1389.9 | 2390.4 | 3782.1 | ### LLaMA-Vicuna-13B | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch LLaMA | 24.65| 167.3 | 322.97 | 407.99 | OOM | | lyraLLaMA fp16 | 53.67 | 421.38 | 804.31 | 1519.28| 2679.82 | | lyraLLaMA MEMOPT | 79.81 | 603.15 | 1117.27 | 1966.52 | 3200.32 | ## Docker Environment Recommendation - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` ```bash docker pull nvcr.io/nvidia/pytorch:23.02-py3 docker run --rm -it --gpus all -v ./:/lyraLLaMA nvcr.io/nvidia/pytorch:23.02-py3 pip install -r requirements.txt python demo.py ``` ## Uses ```python from lyra_llama import lyraLLaMA model_path = "./models/lamma-13b-1-gpu-fp16.bin" tokenizer_path = "./models/" dtype='fp16' prompt = "今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服 裤子鞋子搭配" max_output_length = 512 memopt_mode = 0 # To use MEMOPT mode, set memopt_mode=1 model = lyraLLaMA(model_path, tokenizer_path, dtype, memopt_mode) prompt = ':' + prompt.strip() + '\n:' bs = 1 prompts = [prompt, ] * bs output_texts = model.generate( prompts, output_length=max_output_length, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False) print(output_texts) ``` ## Demo Outputs ### LLaMA-Ziya-13B #### input 今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服 裤子鞋子搭配 #### output 今天天气大概25度,有点小雨,吹着风。建议穿上轻便的运动服装,比如: - 上衣:选择透气性好的运动T恤或短袖运动衫,颜色可以选择暗色系,以免晒伤。 - 裤子:可以选择运动裤或者休闲裤,最好选择能够吸湿排汗的材质,以便于散步时舒适。 - 鞋子:建议穿上运动鞋或者轻便的休闲鞋,以便于走路。鞋底需要有良好的防滑性,以防止滑倒。 - 外套:如果雨比较大,可以搭配一件轻便的雨衣或者防雨衣,以保护身体不受雨淋。 总之,建议穿上轻便、舒适的运动服装,搭配一双合适的鞋子和外套,以便于户外散步。 ### LLaMA-Vicuna-13B #### input List 3 different machine learning algorithms and explain where they are applicable. #### output Outputs: 1. Linear Regression: Linear regression is a simple, yet powerful algorithm that is used to predict the value of a continuous variable based on one or more input features. It is commonly used in prediction and forecasting tasks, such as predicting the price of a stock or the sales of a product. 2. Decision Trees: Decision Trees are a type of supervised learning algorithm that can be used for both classification and regression tasks. They work by partitioning the feature space into smaller subspaces, with each subspace corresponding to a leaf node in the tree. Decision Trees are commonly used in applications such as credit risk assessment and customer segmentation. 3. Support Vector Machines (SVMs): SVMs are a type of supervised learning algorithm that can be used for both classification and regression tasks. They work by finding the best hyperplane that separates the data into different classes. SVMs are commonly used in applications such as image classification and natural language processing. ## TODO 1. Support for int4 2. Inference for longer context situations 3. Streaming inference mode. ## Citation ``` bibtex @Misc{lyraLLaMA2023,   author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},   title =        {lyraLLaMA: Accelerating LLaMA-13b(fp16) to 3000+ tokens/s},   howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLaMA}},   year =         {2023} } ``` ## Report bug - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLaMA/discussions - report bug with a `[bug]` mark in the title.