README.md · jeremy-costello/vicuna-13b-v1.1-4bit-128g at main

metadata

inference: false

4-bit quantization of the vicuna-13b-v1.1 model.

The delta was added to the original LLaMa weights using FastChat.
Quantization and inference with GPTQ-For-LLaMa (commit 58c8ab4).

Quantization args: $MODEL_DIRECTORY, c4, wbits 4, true-sequential, act-order, groupsize 128.
Inference args: $MODEL_DIRECTORY, wbits 4, groupsize 128, load $CHECKPOINT_FILE
Add arg device=0 if using GPU for inference. You may have to change min_length and max_length for better inference outputs.

The separator has been changed to </s>. Simple prompt is "Human: $REQUEST</s>Assistant:".

Delta: https://huggingface.co/lmsys/vicuna-13b-delta-v1.1
FastChat: https://github.com/lm-sys/FastChat
GTPQ-for-LLaMa: https://github.com/qwopqwop200/GPTQ-for-LLaMa