Incomplete tokenizer conversion?

#3
by anisse - opened

This gguf conversion seems to not have the same properties as other llama-based BPE tokenizers. In particular, many ascii and valid unicode characters are impossible to tokenize. I created a simple program to illustrate the issue:
https://github.com/ggerganov/llama.cpp/pull/6988

This also exposes a limitation of llama.cpp: when it cannot tokenize something, it does not use the <unk> token, but crashes.

I haven't verified if this croissantllm-chat tokenizer limitation is specific to this gguf conversion or if it's also in the original.

Any updates on this? I can't even load the croissant GGUF models with llama.cpp right now. I'm trying to load croissantllmchat-v0.1.Q8_0.gguf but no success.

Sign up or log in to comment