TheBloke/MPT-7B-Storywriter-GGML

Yes that's correct.

GGML is a model format developed by a guy called Georgi Gerganov. It's based around C++ code, rather than the Python code that powers Hugging Face transformers, GPTQ, and most other inference methods.

GGML supports unquantised inference, but it's almost always used with quantised models, in 2, 3, 4, 5, 6 or 8-bit. 4-bit being most common.

GGML has always been able to run on smaller hardware than other formats as it runs far better on CPU than other formats. But recently it has also gained decent GPU acceleration, meaning it's also now starting to be competitive on performance as well.

TheBloke
/

MPT-7B-Storywriter-GGML

X-bit Question