Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision

How to run qlora with stable vicuna?

#28
by Andyrasika - opened

Hey @TheBloke ,
Thanks for the amazing model. Is there a way to quatize this model to run on 4 bit?
Looking forward to hearing from you.
Thanks,
Andy

Well yes - this repo IS a 4-bit quantisation :) So you can use this repo with AutoGPTQ or (very soon) Transformers, when it adds native GPTQ support.

Or you can use my stable-vicuna-13B-HF repo with bitsandbytes and load_in_4bit=True for on-the-fly 4bit quantisation.

Or you can use one of the files in my stable-vicuna-13B-GGML repo with llama-cpp-python or ctransformers - there are multiple quantisation sizes in that repo, including 4-bit and bigger and smaller sizes.

All that said, this is an old model and there are much better models released since then. All the recent Llama models I've uploaded are based on Llama 2 so they are commercially licensed, have greater context size (4096 vs 2048), and their base models were trained on twice as many tokens (2T vs 1T). So I no longer recommend this much older Llama 1 model.

Sign up or log in to comment