possible issue with tokenizer
I've been using llama.cpp to quantize these models (the 2b variants so far) with a robust dataset (the Oscar one) for the imatrix. And now I am on the 7b. For the first time, I am getting these errors (explicitly, I searched the last one):
llm_load_vocab: control token: 99 '<|reserved_token_94|>' is not marked as EOG
...
llm_load_vocab: control token: 34 '<|reserved_token_29|>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
These are essentially warnings, and looking on the site I see sometimes this last error exists for public models, so maybe this is all okay. But I wanted to raise th issue in case it matters.
I am having issues generating the imatrix. I am pretty well convinced at this point that these are warnings introduced with a recent change to llama.cpp: https://github.com/ggerganov/llama.cpp/issues/9899
I am still having issues, it keeps crashing while generating the importance matrix.
I noticed that the default handling for SentencePiece-BPE type tokenizers in llama.cpp was not using the tokenizer.json, just the tokenizer.model. I modified the convert_hf_to_gguf.py script to allow it to extend the vocab from the tokenizer.json, but I still get the same "nan detected inblk.21.attn_output.weight" when generating the imatrix.
I see that two weeks ago in the 7b-instruct discussion someone had issues (presumably without generating an imatrix) and it was mentioned that GGUF models were incoming. Are you guys working with llama.cpp to do the conversion? Can you shed any light on my issue?
Supposedly that's why the GGUF quants work improperly with the llama cpp server in all my tests...
Maybe you could go into more detail about that?
What I can say for sure is the published tokenizer is actually the same as on the 2b models and I had no issues with those. Here I went further and verified that every token specified was correctly picked up in llama.cpp, editing the code to verify it against the tokenizer config and the underlying vocabulary they published in GitHub. I wonder if the published tokenizer is not the tokenizer they used training this particular model. The other option is that the grouped query attention is not handled well by llama.cpp - some of the team has been told and was participating briefly with me but they did not indicate it was a problem
My issue is that the model (at least being quantized) doesn't follow the instructions. For example, if I ask the model to translate from Dutch into Danish, it provides me the same text in the source language. I discovered now that it happens when the text size is more than ~ 710 tokens (didn't figured it out the exact size)...