File size: 8,297 Bytes
adf14b0
dc853d0
b767fe7
 
 
adf14b0
b767fe7
 
2aeb4e3
 
e0d32b5
a2a7f2a
dd3b743
e0d32b5
 
 
 
2aeb4e3
 
e0d32b5
fa7eec5
2aeb4e3
 
e0d32b5
0599904
 
 
b767fe7
 
 
 
 
2aeb4e3
aa526cc
4784f92
b767fe7
 
 
 
88124e3
 
a69c174
88124e3
a69c174
 
95c4815
a69c174
 
 
a567ec5
a69c174
 
a567ec5
 
a69c174
 
 
 
 
 
 
a567ec5
88124e3
 
 
 
b767fe7
 
 
 
 
 
 
 
c669a07
 
b767fe7
01933fb
b767fe7
 
 
28ab3dd
01933fb
b767fe7
 
 
 
01933fb
b767fe7
c669a07
 
b767fe7
 
 
01933fb
b767fe7
 
 
 
 
 
 
01933fb
b767fe7
c669a07
b767fe7
01933fb
b767fe7
 
 
 
 
88124e3
b767fe7
 
 
 
 
 
 
 
2aeb4e3
a567ec5
 
b767fe7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc853d0
 
4784f92
dc853d0
 
 
 
 
4784f92
 
 
b767fe7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
license: other
inference: true
tags:
- vicuna
---
![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)

This is a 8bit GPTQ (not to be confused with 8bit RTN) version of Vicuna 13B v1.1 HF.

Q. Why quantized in 8bit instead of 4bit?
A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out.
Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation.

Refs:
- https://github.com/ggerganov/llama.cpp/pull/951
- https://news.ycombinator.com/item?id=35148542
- https://github.com/ggerganov/llama.cpp/issues/53
- https://arxiv.org/abs/2210.17323
- https://arxiv.org/abs/2105.03536
- https://arxiv.org/abs/2212.09720
- https://arxiv.org/abs/2301.00774
- https://github.com/IST-DASLab/gptq

<br>

**This model is a 8bit quantization of Vicuna 13Bv1.1.**  
- 13B parameters
- Group size: 128
- wbits: 8
- true-sequential: yes
- act-order: yes
- 8-bit GPTQ
- c4
- Conversion process: LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g

<br>
<br>

# Benchmarks

Using https://github.com/qwopqwop200/GPTQ-for-LLaMa/. Best results in **bold**.

*`--benchmark 2048 --check` results:*
| Model  | wikitext2 PPL  | ptb PPL  | c4 PPL  | VRAM Utilization  |
|---|---|---|---|---|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g  | 8.517391204833984  | 20.888103485107422  | **7.058407783508301**  | **8670.26953125**  |
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g  | **8.508771896362305**  | **20.75649070739746**  | 7.105874538421631  | 14840.26171875  |

*`--eval` results:*
| Model  | wikitext2 PPL  | ptb PPL  | c4 PPL  |
|---|---|---|---|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g  | 7.119165420532227 | 25.692861557006836 | 9.06746768951416  |
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g  | **6.988043308258057**  | **24.882535934448242**  | **8.991846084594727**  |

*`--new-eval --eval` results:*
| Model  | wikitext2 PPL  | ptb-new PPL  | c4-new PPL  |
|---|---|---|---|
| 4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g  | 7.119165420532227  | 35.637290954589844  | 9.550592422485352  |
| 8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g  | **6.988043308258057**  | **34.264320373535156**  | **9.426002502441406**  |

PPL = Perplexity (lower is better) - https://huggingface.co/docs/transformers/perplexity

<br>
<br>

# Basic installation procedure

- It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out.
- I will not provide installation support, sorry.
- You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui.
- You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages.
- The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes.
- If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade.
- If this model produces answers with weird characters, it means you a using a broken commit of qwopqwop200/GPTQ-for-LLaMa.
- If this model produces answers that are out of topic or if it talks to itself, it means you a using a broken commit of qwopqwop200/GPTQ-for-LLaMa.

RECOMMENDED - Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:
```
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
#git fetch origin pull/1229/head:triton # Since been merged # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
git checkout triton
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
cd GPTQ-for-LLaMa
#git checkout 508de42 # Since been fixed # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
git checkout 210c379 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one
pip install -r requirements.txt
```

DISCOURAGED - Cuda (Slow tokens/s) and output issues https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/128:
```
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
cd GPTQ-for-LLaMa
git checkout 505c2c7 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one
pip install -r requirements.txt
python setup_cuda.py install
```

<br>
<br>

# Testbench detail and demo

- Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229

- NVIDIA GTX 3090
- 32BG DDR4
- i9-7980XE OC @4.6Ghz

- 11 tokens/s on average with Triton
- Equivalent tokens/s observed over the 4bit version
- Pending preliminary observation: better quality results than 8bit RTN / --load-in-8bits (To Be Confirmed)
- Observation: better quality results than 4bit GPTQ (c.f. PPL benchmarks above) at the cost of +71% VRAM usage.
- Tested and working in both chat mode and text generation mode

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_02.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_03.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_04.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_05.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_06.png)

![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_07.png)

<br>
<br>

# License

Research only - non-commercial research purposes - other restrictions apply. See inherited LICENSE file from LLaMa.

LLaMA-13B converted to work with Transformers/HuggingFace is under a special license, please see the LICENSE file for details.

https://www.reddit.com/r/LocalLLaMA/comments/12kl68j/comment/jg31ufe/

<br>
<br>

# Vicuna Model Card

## Model details

**Model type:**
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
It is an auto-regressive language model, based on the transformer architecture.

**Model date:**
Vicuna was trained between March 2023 and April 2023.

**Organizations developing the model:**
The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.

**Paper or resources for more information:**
https://vicuna.lmsys.org/

**License:**
Apache License 2.0

**Where to send questions or comments about the model:**
https://github.com/lm-sys/FastChat/issues

## Intended use
**Primary intended uses:**
The primary use of Vicuna is research on large language models and chatbots.

**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.

## Training dataset
70K conversations collected from ShareGPT.com.

## Evaluation dataset
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.

## Major updates of weights v1.1
- Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries.
- Fix the supervised fine-tuning loss computation for better model quality.