jeremy-costello
/

vicuna-13b-v1.1-4bit-128g

Text Generation

text-generation-inference

Model card Files Files and versions Community

vicuna-13b-v1.1-4bit-128g / README.md

jeremy-costello's picture

jeremy-costello

remove license

aad7f48 over 1 year ago

|

history blame contribute delete

775 Bytes

	---
	inference: false
	---
	4-bit quantization of the vicuna-13b-v1.1 model.

	The delta was added to the original LLaMa weights using FastChat. \
	Quantization and inference with GPTQ-For-LLaMa (commit 58c8ab4).

	Quantization args: $MODEL_DIRECTORY, c4, wbits 4, true-sequential, act-order, groupsize 128. \
	Inference args: $MODEL_DIRECTORY, wbits 4, groupsize 128, load $CHECKPOINT_FILE \
	Add arg device=0 if using GPU for inference. You may have to change min_length and max_length for better inference outputs.

	The separator has been changed to \</s\>. Simple prompt is "Human: $REQUEST\</s\>Assistant:".

	Delta: https://huggingface.co/lmsys/vicuna-13b-delta-v1.1 \
	FastChat: https://github.com/lm-sys/FastChat \
	GTPQ-for-LLaMa: https://github.com/qwopqwop200/GPTQ-for-LLaMa