Thireus commited on
Commit
b767fe7
1 Parent(s): d75d38b

Readme Init

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ inference: true
4
+ tags:
5
+ - vicuna
6
  ---
7
+ ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)
8
+
9
+ **This model is a 8bit quantization of Vicuna 13B.**
10
+ - 13B parameters
11
+ - Group size: 128
12
+ - wbits: 8
13
+ - true-sequential: yes
14
+ - act-order: yes
15
+ - 8-bit quantized - Read more about this here: https://github.com/ggerganov/llama.cpp/pull/951
16
+ - Conversion process: Llama 13B -> Llama 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g
17
+
18
+ <br>
19
+ <br>
20
+
21
+ # Basic installation procedure
22
+
23
+ - It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out.
24
+ - I will not provide installation support, sorry.
25
+ - You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui.
26
+ - You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages.
27
+ - The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes.
28
+ - If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade.
29
+ - If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
30
+ - If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
31
+
32
+ Cuda (Slow tokens/s):
33
+ ```
34
+ git clone https://github.com/oobabooga/text-generation-webui
35
+ cd text-generation-webui
36
+ pip install -r requirements.txt
37
+
38
+ mkdir repositories
39
+ cd repositories
40
+ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
41
+ cd GPTQ-for-LLaMa
42
+ pip install -r requirements.txt
43
+ python setup_cuda.py install
44
+ ```
45
+
46
+ Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:
47
+ ```
48
+ git clone https://github.com/oobabooga/text-generation-webui
49
+ cd text-generation-webui
50
+ git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
51
+ git checkout triton
52
+ pip install -r requirements.txt
53
+
54
+ mkdir repositories
55
+ cd repositories
56
+ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
57
+ cd GPTQ-for-LLaMa
58
+ git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ <br>
63
+ <br>
64
+
65
+ # Testbench detail and results
66
+
67
+ - Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229
68
+
69
+ - NVIDIA GTX 3090
70
+ - 32BG DDR4
71
+ - i9-7980XE OC @4.6Ghz
72
+
73
+ - 11 tokens/s on average with Triton
74
+ - Preliminary observations: better results than --load-in-8bits (To Be Confirmed)
75
+ - Tested and working in both chat mode and text generation mode
76
+
77
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png)
78
+
79
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_02.png)
80
+
81
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_03.png)
82
+
83
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_04.png)
84
+
85
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_05.png)
86
+
87
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_06.png)
88
+
89
+ ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_07.png)
90
+
91
+ <br>
92
+ <br>
93
+
94
+ # Vicuna Model Card
95
+
96
+ ## Model details
97
+
98
+ **Model type:**
99
+ Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
100
+ It is an auto-regressive language model, based on the transformer architecture.
101
+
102
+ **Model date:**
103
+ Vicuna was trained between March 2023 and April 2023.
104
+
105
+ **Organizations developing the model:**
106
+ The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.
107
+
108
+ **Paper or resources for more information:**
109
+ https://vicuna.lmsys.org/
110
+
111
+ **License:**
112
+ Apache License 2.0
113
+
114
+ **Where to send questions or comments about the model:**
115
+ https://github.com/lm-sys/FastChat/issues
116
+
117
+ ## Intended use
118
+ **Primary intended uses:**
119
+ The primary use of Vicuna is research on large language models and chatbots.
120
+
121
+ **Primary intended users:**
122
+ The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
123
+
124
+ ## Training dataset
125
+ 70K conversations collected from ShareGPT.com.
126
+
127
+ ## Evaluation dataset
128
+ A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.
129
+
130
+ ## Major updates of weights v1.1
131
+ - Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries.
132
+ - Fix the supervised fine-tuning loss computation for better model quality.