Hallucinations, misspellings etc. Something seems broken?

#10

by sam-paech - opened Jun 28, 2024

Jun 28, 2024

•

edited Jun 28, 2024

I'm trying to run a few benchmarks with this model and it's not behaving. I'm seeing hallucinations, misspellings, poor instruction following and bad benchmark scores. I've tried gemma-2-9b-it and it's fine.

So far I've tried inferencing with:

Full 16 bit precision (transformers)
Bitsandbytes 8-bit
All the ggufs available (via llama.cpp)
Via the huggingface pro api

I'm using the tokenizer's chat template.

----eq-bench Benchmark Complete----
2024-06-28 10:46:02
Time taken: 10.1 mins
Prompt Format:
Model: google/gemma-2-27b-it
Score (v2): 49.16
Parseable: 129.0
! eq-bench Benchmark Failed

ArthurZ

Google org Jun 28, 2024

Yes we are investigating what went wrong! Note that float16 should not be used for this model

TopdownAI

Jun 28, 2024

Yes we are investigating what went wrong! Note that float16 should not be used for this model

Is it okay to use int4 and bfloat16 to infer?

suryabhupa

Google org Jun 28, 2024

bfloat16 should be good to infer, but I haven't tested int4 so I'm not sure what the quality will be like.

TopdownAI

Jun 28, 2024

•

edited Jun 28, 2024

Yes we are investigating what went wrong! Note that float16 should not be used for this model

I tried with float16 and it only output pad tokens. I changed it to bfloat16 and it works fine!

TopdownAI

Jun 28, 2024

bfloat16 should be good to infer, but I haven't tested int4 so I'm not sure what the quality will be like.

Thank you.
I'm trying it locally with int4 and it seems to have quite a performance hit, at least in Korean. (My experience with it in Ai studio was pretty good.)
It can't follow the instructions very well and the tokens are getting squashed.
Still, the experience of running the 27B model locally is great!
A single-turn conversation uses about 18GB of VRAM.
Longer context lengths can exceed the RTX 4090's 24GB VRAM.

bartowski

Jun 28, 2024

@sam-paech does 'all the GGUFs available' include the ones I posted this morning? just want to double check

sam-paech

Jun 28, 2024

•

edited Jun 28, 2024

@sam-paech does 'all the GGUFs available' include the ones I posted this morning? just want to double check

Yep. It went:
transformers 16 bit > bitsandbytes 8bit > gguf Q8_0 (yours and others')

in rough order of quality. They all seem various degrees of broken though.

bartowski

Jun 28, 2024

interesting, i wonder if it's the lack of logit softcap or if something else is playing a role

is the transformers bf16 totally fine or does it also experience unexpected degradation?

sam-paech

Jun 28, 2024

interesting, i wonder if it's the lack of logit softcap or if something else is playing a role

is the transformers bf16 totally fine or does it also experience unexpected degradation?

It wasn't fine. I'm about to test the latest patch though, will let you know.

sam-paech

Jun 28, 2024

interesting, i wonder if it's the lack of logit softcap or if something else is playing a role

is the transformers bf16 totally fine or does it also experience unexpected degradation?

Ok it seems fixed since the latest transformers patch.

https://github.com/huggingface/transformers/pull/31698

We will have to wait for a llama.cpp patch

bartowski

Jun 28, 2024

so that must be it then, it's the soft-cap

wonder how easy that is the implement

cpumaxx

Jun 28, 2024

I'm getting [end of text] after a reasonable length of time most of the time
I have had some gens that just seem to keep going, but they're outliers

sam-paech changed discussion status to closed Jun 28, 2024

MLDataScientist

Jun 28, 2024

there is a PR for llama.cpp to fix the soft-cap and it requires a new gguf generation: https://github.com/ggerganov/llama.cpp/pull/8197

jackboot

Jun 29, 2024

run on generates in hugging chat too

sam-paech changed discussion status to open Jun 29, 2024

sam-paech

Jun 29, 2024

Reopening since it's not yet fixed in llama.cpp and people may be wondering

terryyz

Jun 29, 2024

The instruction following capability of the bfloat16 version on vLLM seems poor. 9b-it doesn't have such issues. @suryabhupa

AlexBefest

Jul 20, 2024

Update maybe? It's been 25 days and the model is still broken...

softclone

Jul 24, 2024

Looks like there were 3 PRs for gemma fixes merged into llama.cpp 3 weeks ago. Vllm v0.5.1 also released around the same time with gemma2 support.

imakbar

Jul 27, 2024

•

edited Jul 27, 2024

I'm using this with TGI, bfloat16, and the chat template from AutoTokenizer in transformers. The output is very poor quality compared to the same settings with gemma-2-9b-it.
What's wrong with this version?

caisarl76

Aug 7, 2024

•

edited Aug 7, 2024

Is this issue completed?
use eager mode on attention attn_implementation="eager"

FYI

sam-paech

Aug 7, 2024

Closing as the above issues seem to be resolved.

sam-paech changed discussion status to closed Aug 7, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment