This model uses <0x0A> instead of a carriage return

#1
by YearZero - opened

I think this was happening to another model in the last few days, I forgot which. Instead of CRLF it just outputs <0x0A>

@YearZero are you using GPT4All? My guess is the Bloke's GGUF files are good. I'm getting the same from all recent models. I checked GPT4All's Github page and there's a token issue with GGUF that is being fixed in the next update. It likely includes a fix for this.

The latest llama.cpp does output <0x0A> instead of a regular LF.

It looks like it could be a problem with the tokenizer like this model:
https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF/discussions/1#6566495193951c950b3b8c10

Yeah, it's also impacting Notus and is summed up succinctly by @alvarobartt

https://huggingface.co/argilla/notus-7b-v1/discussions/3#656dc9d802a56b531ade7f73

Yes, I get this token always <0x0A> and sometimes a series of tokens like this: <0x0A><0x0A><0xF0><0x9F><0x9A><0xB8>

any solution to this <0x0A> issue?

@zhiboz Yes, the Bloke just fixed this for Notus. Someone needs to bring this to @TheBloke attention.

Indeed it was due to the addition of the tokenizer.json file, since GGUF is expecting the SentencePiece (slow tokenizer) rather than the fast one (Rust-based) which is transformers default one, that's why then when running python convert.py ... from llama.cpp to convert the weights into GGUF, the model is not available, so the tokenization needs to be inferred from the vocab file and the format (e.g. BPE for the Mistral-based models). Anyway, it seems that the tokenizer.model has been uploaded 9 days ago into berkeley-nest/Starling-LM-7B-alpha (see https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha/blob/main/tokenizer.model), so re-running the GGUF conversion script would do the work 🤗

Sign up or log in to comment