GGUF files have tokenizer issues

#1
by JohannesGaessler - opened

The models in this repository seem to have tokenizer issues, see https://github.com/ggerganov/llama.cpp/pull/6936#issuecomment-2107368738 , causing degraded results. This is indicated by the following warning when running the models:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
Quant Factory org

@JohannesGaessler I’ll look into this and get back. V1 was done before the BPE fix and this is done after the BPE fix

Sign up or log in to comment