---
license: apache-2.0
inference: true
tags:
- vicuna
---
![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)

**This model is a 8bit quantization of Vicuna 13B.**  
- 13B parameters
- Group size: 128
- wbits: 8
- true-sequential: yes
- act-order: yes
- 8-bit quantized - Read more about this here: https://github.com/ggerganov/llama.cpp/pull/951
- Conversion process: Llama 13B -> Llama 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g

<br>
<br>

# Basic installation procedure

- It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out.
- I will not provide installation support, sorry.
- You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui.
- You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages.
- The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes.
- If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade.
- If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
- If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below.

Cuda (Slow tokens/s):
```
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
```

Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:
```
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
git checkout triton
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
cd GPTQ-for-LLaMa
git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
pip install -r requirements.txt
```

<br>
<br>

# Testbench detail and results

- Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229

- NVIDIA GTX 3090
- 32BG DDR4
- i9-7980XE OC @4.6Ghz

- 11 tokens/s on average with Triton
- Preliminary observations: better results than --load-in-8bits (To Be Confirmed)
- Tested and working in both chat mode and text generation mode

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_02.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_03.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_04.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_05.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_06.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_07.png)

<br>
<br>

# Vicuna Model Card

## Model details

**Model type:**
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
It is an auto-regressive language model, based on the transformer architecture.

**Model date:**
Vicuna was trained between March 2023 and April 2023.

**Organizations developing the model:**
The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.

**Paper or resources for more information:**
https://vicuna.lmsys.org/

**License:**
Apache License 2.0

**Where to send questions or comments about the model:**
https://github.com/lm-sys/FastChat/issues

## Intended use
**Primary intended uses:**
The primary use of Vicuna is research on large language models and chatbots.

**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.

## Training dataset
70K conversations collected from ShareGPT.com.

## Evaluation dataset
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.

## Major updates of weights v1.1
- Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries.
- Fix the supervised fine-tuning loss computation for better model quality.