Non-instruct models are missing the tokenizer.json files

#1
by elsatch - opened

Hi!

I am working to add support for Salamandra in Llama.cpp. For some reason, only the instruct models have a tokenizer.json file. Both non-instruct models lack it, so when trying to perform gguf conversions it fails.

Would you be so kind to upload those files to salamandra-7b and salamandra-2b?

Thanks!

Language Technologies Unit @ Barcelona Supercomputing Center org

Hi!

Since this is a SentencePiece model, and we weren’t aware that the GGUF conversion requires a different format, we didn’t upload the tokenizer.json initially. Apologies for that! We’ve now uploaded the tokenizer.json.
If you encounter any other issues, please feel free to let us know.

Thanks for the heads-up!

joanllop changed discussion status to closed

Hi Joan,

I wasn't aware either but when using the convert_hf_to_gguf_update.py it tries to download all the files from the repo, and fails if there is no tokenizer.json present : https://github.com/ggerganov/llama.cpp/blob/c81f3bbb051f8b736e117dfc78c99d7c4e0450f6/convert_hf_to_gguf_update.py#L117

In theory, when you save the tokenizer it should produce all files but in the github repo there is only the vocab.json file.

Best,
César

I believe I downloaded the tokenizer manually when I saw it was missing, but still got the error I reported

btw, (I realize this is now closed and Caesar has his quant up already), that link about convert_hf_to_gguf.py lacking the slow tokenizer path is no longer valid. I just submitted a change to that file in a PR to llama.cpp for LLamaForCasualLM models to ensure it captures all added_tokens (which may or may not actually be wanted but in v3927 specifically there were warnings, which I found interesting), so I'm familiar with the code currently. I can attest that the function has changed to support the "slow" method :)

Sign up or log in to comment