Custom Quantization Types

#2
by christopherthompson81 - opened

I would like to make my own quants but the vocab is shown as incomplete when converting with llama.cpp. This likely means that the tokenizer implementation is unsupported in llama.cpp:main. Is there a llama.cpp PR or a fork I could look at that does support it? Like the nous-llama.cpp repo?

What quant format? I was able to make AWQ without any issue.

Use the --pad-vocab option when converting to gguf (this will resolve the issue):

python convert.py $model --pad-vocab --outtype f16

Padding the vocab ends up with dummy tokens being generated pretty frequently and is therefore not useful.

you can use convert-hf-to-gguf as well

NousResearch org

Is this all good now? We resized vocab to a multiple of 32 even though we only added 2 tokens because it causes less issues with tensor parallelism and should make the model inference faster

It was fine to begin with. All anyone had to do to successfully convert the model to GGUF was use —pad-vocab. Doing so would resolve the issue of unequal sized vocabs and allows the conversion process to successfully complete. 😁

deleted
This comment has been hidden
teknium changed discussion status to closed

Sign up or log in to comment