Prompt format <bos> not needed?

#3
by eamag - opened

I noticed llama.cpp already adds in the beginning and if I use suggested prompt format from the readme here it warns me about adding it twice. Should the model card be updated? What is the actual prompt format for these weights? Thank you!

I just spit out what's in the tokenizer_config.json chat template, if llama.cpp is adding one on it's own then you shouldn't include it yourself that's correct. Teeeechnically what I have in the model card is the most correct representation of what the prompt looks like, but I could remove the bos token to avoid confusion

Just a heads up, there seems to be some serious issues with this model regardless of whether you use the template correctly or not. In my testing it performs significantly worse than the 9b version, so much worse that there's clearly something fundamentally wrong. And I've seen many others have the same experience. An issue has been created on the official Repo, and Google states they are currently investigating it.

@Mikael110 did you also test my quants uploaded here ~10 hours ago? just curious since there was an update that fixed some issues, so curious if it's still there or if it was previously there

[edit: I didn't get the warning because of using --in-suffix to add the extra bos token]
llamacpp is supposed to now warn you if you've accidentally doubled up the bos token for a particular model (I think maybe the gguf can be set to add one?), so for this model it seems you need to add the bos token manually (because I don't get a warning when I do).

But yeah also it seems to be broken in a different way.

@Mikael110 did you also test my quants uploaded here ~10 hours ago? just curious since there was an update that fixed some issues, so curious if it's still there or if it was previously there

Yes, I did. And I verified that the tokenizer was correct by using the llama-tokenizer utility. So I know for certain it's not the tokenizer at fault.

Edit: A likely issue has been identified. Gemma-2 was trained with Logit soft-capping, but due to it being incompatible with Flash Attenstion Google decided to disable it for libraries like Transformers. They thought it would have a relatively minor performance impact. But it turns out that for larger models it actually matters a lot.

Transformers have just merged a fix, and llama.cpp will likely follow soon.

Using an up to date llama.cpp pull:
convert-hf-to-gguf.py --outtype bf16
from an ooba venv gave me a usable bf16 gguf and
llama-quantize gemma2-27b-bf16.gguf gemma2-27b-q8.gguf Q8_0
gave me another usable gguf now at Q8.
Inference with that same PR branch produces coherent output in line with the model size.

Thank you @cpumaxx .

@bartowski , I think this model repo needs another update with the latest llama.cpp quantization. This might be your third upload for Gemma 2 27B but hopefully this is a final version that works accurately. Thank you!

@cpumaxx which llama.cpp pull are you using? Based on this discussion, llama.cpp may not have that patch yet, right?

I was using https://github.com/pculliton/llama.cpp/ which was merged 14 hours ago

These are also based on the PR that was merged 14 hours ago, so that can't be it.. my tests also produce "coherent" results, but they tend to endlessly generate, and i wouldn't be surprised if it was because of the lack of soft-cap

Does this model contain censorship?

there is a PR for llama.cpp to fix the soft-cap and it requires a new gguf generation: https://github.com/ggerganov/llama.cpp/pull/8197

I'll remake when it's merged 😅

I'll remake when it's merged 😅

Just wanted to thank you for keeping up on this. Most GGUF repos are never updated despite breakage being common for new model archs like this. Without you keeping the repo uptodate a lot of people would never even get a chance to play around with the fixed models, which would be a great shame, especially in this case as the models actually seem quite promising.

Interesting note: I didn't get the warning because I was using --in-suffix, etc. I got the warning when I used -p.

Looks like with latest changes you won't necessarily need a remade GGUF, they're setting default values, but I assume my imatrix quality will improve - if there's issues inferencing, there's issues measuring the imatrix - plus it would be nice to have the proper metadata..

As for that --in-suffix fix, that's pretty interesting

Sign up or log in to comment