Thireus's picture
Readme Init
b767fe7
|
raw
history blame
No virus
5.19 kB
metadata
license: apache-2.0
inference: true
tags:
  - vicuna

demo

This model is a 8bit quantization of Vicuna 13B.

  • 13B parameters
  • Group size: 128
  • wbits: 8
  • true-sequential: yes
  • act-order: yes
  • 8-bit quantized - Read more about this here: https://github.com/ggerganov/llama.cpp/pull/951
  • Conversion process: Llama 13B -> Llama 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g


Basic installation procedure

  • It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out.
  • I will not provide installation support, sorry.
  • You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui.
  • You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages.
  • The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes.
  • If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade.
  • If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
  • If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below.

Cuda (Slow tokens/s):

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install

Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
git checkout triton
pip install -r requirements.txt

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
cd GPTQ-for-LLaMa
git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
pip install -r requirements.txt


Testbench detail and results

  • Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229

  • NVIDIA GTX 3090

  • 32BG DDR4

  • i9-7980XE OC @4.6Ghz

  • 11 tokens/s on average with Triton

  • Preliminary observations: better results than --load-in-8bits (To Be Confirmed)

  • Tested and working in both chat mode and text generation mode

screenshot

screenshot

screenshot

screenshot

screenshot

screenshot

screenshot



Vicuna Model Card

Model details

Model type: Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture.

Model date: Vicuna was trained between March 2023 and April 2023.

Organizations developing the model: The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.

Paper or resources for more information: https://vicuna.lmsys.org/

License: Apache License 2.0

Where to send questions or comments about the model: https://github.com/lm-sys/FastChat/issues

Intended use

Primary intended uses: The primary use of Vicuna is research on large language models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.

Training dataset

70K conversations collected from ShareGPT.com.

Evaluation dataset

A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.

Major updates of weights v1.1

  • Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from "###" to the EOS token "</s>". This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries.
  • Fix the supervised fine-tuning loss computation for better model quality.