croissantllm/CroissantLLMChat-v0.1-GGUF · Incomplete tokenizer conversion?

This gguf conversion seems to not have the same properties as other llama-based BPE tokenizers. In particular, many ascii and valid unicode characters are impossible to tokenize. I created a simple program to illustrate the issue:
https://github.com/ggerganov/llama.cpp/pull/6988

This also exposes a limitation of llama.cpp: when it cannot tokenize something, it does not use the <unk> token, but crashes.

I haven't verified if this croissantllm-chat tokenizer limitation is specific to this gguf conversion or if it's also in the original.