pre_tokenizer identifier mismatch with llama.cpp, "gpt-2" not recognized

#2
by deuxenun - opened

llama.cpp only recognizes "gpt-2" as a pre_tokenizer field, but the gguf file has "gpt2". Adding "gpt2" here https://github.com/ggml-org/llama.cpp/blob/0d18aaa9d1a8af3df9abccd828e22eeaac7f840b/src/llama-vocab.cpp#L2082 seems to work with basic prompts ("What is the capital of France", etc.)

I suggest updating the gguf file to tokenizer.ggml.pre = "gpt-2" to make it compatible with llama.cpp

AxionLab-official changed discussion status to closed

Sign up or log in to comment