TheBloke/LLaMA-7b-AWQ · Confusion about size on disk

Oct 9, 2024

What is the reason the params is stored in float16 and int32 in the .safetensors file?
Why it is not another format such as int4 to shrink the size on disk by a factor of 4-8 times

YaTharThShaRma999

Oct 9, 2024

@laiviet Its stored in 4bit? The original model file is slightly over 13gb(https://huggingface.co/huggyllama/llama-7b/tree/main)

This quantized version is less then 4gb.

laiviet

Oct 10, 2024

•

edited Oct 10, 2024

@YaTharThShaRma999
here is the data I loaded from the .safetensor
Total params: 1,128,828,928
Total bytes: 3,889,307,648

A linear layer weight (4096, 4096) is decomposed into these, which scale down significantly in size.
They are all stored as either I32 (int32) or F16 (fp16)

model.layers.0.self_attn.v_proj.qweight [4096, 512] I32
model.layers.0.self_attn.v_proj.qzeros [32, 512] I32
model.layers.0.self_attn.v_proj.scales [32, 4096] F16

My hypothesis is that the safetensor doesn't have int format, so they save it has I32 each I32 can store 8 I4 params
So actually
model.layers.0.self_attn.v_proj.qweight [4096, 512] in I32 is [4096, 4096] in I4
model.layers.0.self_attn.v_proj.qzeros [32, 512] in I32 is [32, 4096] in I4