Quantized Meta AI's LLaMA in 4bit with the help of GPTQ algorithm v2.

Conversion process:

CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-13b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors ./q4/llama13b-4bit-ts-ao-g128-v2.safetensors

Note: This model will fail to load with current GPTQ-for-LLaMa implementation

Conversion process

CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-13b c4 --wbits 4 --true-sequential --act-order  --save_safetensors ./q4/llama13b-4bit-v2.safetensors
Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.