mlx-community/Meta-Llama-3-8B-8bit · i Apologise for asking dumb question but what's the difference between 8 bit and 4 bit

With 4-bit quantization, each model weight is represented using 4 bits, resulting in a smaller model size but potentially reduced accuracy compared to the full-precision model. On the other hand, 8-bit quantization strikes a balance between model size reduction and accuracy, as each weight is represented using 8 bits, which typically yields higher accuracy than 4-bit quantization while still providing significant compression benefits compared to the full-precision model.

The choice between 4-bit and 8-bit quantization for LLMs depends on the specific trade-off between model size, inference speed, and acceptable accuracy loss for the target application and hardware constraints.