Transformers
mpt
Composer
MosaicML
llm-foundry
text-generation-inference

X-bit Question

#6
by shadowthecat1918 - opened

If I understand correctly, 4-bit, 8-bit, and X-bit systems are developed to run on hardware more accessible to the general public. Is that correct? What is GGML and is it also meant to boost performance for the general public?

Yes that's correct.

GGML is a model format developed by a guy called Georgi Gerganov. It's based around C++ code, rather than the Python code that powers Hugging Face transformers, GPTQ, and most other inference methods.

GGML supports unquantised inference, but it's almost always used with quantised models, in 2, 3, 4, 5, 6 or 8-bit. 4-bit being most common.

GGML has always been able to run on smaller hardware than other formats as it runs far better on CPU than other formats. But recently it has also gained decent GPU acceleration, meaning it's also now starting to be competitive on performance as well.

Sign up or log in to comment