Converted to HF with transformers 4.30.0.dev0, then quantized to 4 bit with GPTQ (Group size 32):

python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors

PPL should be marginally better than group size 128 at the cost of more VRAM. An A6000 should still be able to fit it all at full 2048 context.


Note that this model was quantized under GPTQ's cuda branch. Which means it should work with 0cc4m's KoboldAI fork: https://github.com/0cc4m/KoboldAI

Downloads last month
11
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.