TheBloke/Llama-2-70B-fp16 · How to quantise the model?

Aug 6, 2023

Hi, I have learned how to convert the original format to hf format from your document and converted the 13b model successfully. But I can not load the hf model into GPU since it takes over than 39G VRAM. So I wonder quantise the model may solve my problem but I dont know how to do it.

jlzhou

Aug 7, 2023

Performing quantization also requires loading the model into VRAM, instead you could try some already-quantized model.

For example this gptq one

szbigcat

Aug 7, 2023

•

edited Aug 7, 2023

Performing quantization also requires loading the model into VRAM, instead you could try some already-quantized model.

For example this gptq one

Thank you. I realize I have to give up doing it on my own PC now. But I still want to know how to quantize a model, maybe I will try it on a smaller model like the LLama-2 7B model later.

Is that the code below is the answer to my question?

# Apply quantization
quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8, inplace=False
)

# Convert the quantized model to FP16 (half-precision) 4-bit
quantized_model = quantized_model.to(torch.float16)

# Save the quantized model
torch.save(quantized_model.state_dict(), "fp16_4bit_model.pth")