--- license: apache-2.0 inference: true tags: - vicuna --- ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png) **This model is a 8bit quantization of Vicuna 13B.** - 13B parameters - Group size: 128 - wbits: 8 - true-sequential: yes - act-order: yes - 8-bit quantized - Read more about this here: https://github.com/ggerganov/llama.cpp/pull/951 - Conversion process: Llama 13B -> Llama 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g

# Basic installation procedure - It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out. - I will not provide installation support, sorry. - You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui. - You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages. - The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes. - If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade. - If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below. - If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below. Cuda (Slow tokens/s): ``` git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui pip install -r requirements.txt mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes") cd GPTQ-for-LLaMa pip install -r requirements.txt python setup_cuda.py install ``` Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux: ``` git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229 git checkout triton pip install -r requirements.txt mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda cd GPTQ-for-LLaMa git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183 pip install -r requirements.txt ```

# Testbench detail and results - Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229 - NVIDIA GTX 3090 - 32BG DDR4 - i9-7980XE OC @4.6Ghz - 11 tokens/s on average with Triton - Preliminary observations: better results than --load-in-8bits (To Be Confirmed) - Tested and working in both chat mode and text generation mode ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_02.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_03.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_04.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_05.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_06.png) ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_07.png)

# Vicuna Model Card ## Model details **Model type:** Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture. **Model date:** Vicuna was trained between March 2023 and April 2023. **Organizations developing the model:** The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego. **Paper or resources for more information:** https://vicuna.lmsys.org/ **License:** Apache License 2.0 **Where to send questions or comments about the model:** https://github.com/lm-sys/FastChat/issues ## Intended use **Primary intended uses:** The primary use of Vicuna is research on large language models and chatbots. **Primary intended users:** The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. ## Training dataset 70K conversations collected from ShareGPT.com. ## Evaluation dataset A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details. ## Major updates of weights v1.1 - Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `""`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries. - Fix the supervised fine-tuning loss computation for better model quality.