google/gemma-2b · Unable to reproduce the score of gemma

Apr 16, 2024

I want to reproduce the humaneval pass@1 score of gemma_2b. When I tested with the parameters below, I got a score of 0.11.
completion = model.generate(input_ids=inputs["input_ids"],
max_length=512,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id)
After modifying the parameters as shown below, I got 0.14.
completion = model.generate(input_ids=inputs["input_ids"],
max_length=512,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.95) # 增加 max_length 的值
I would like to know how to modify the parameters to achieve 0.22.

suryabhupa

Google org Jun 10, 2024

Hi, Surya from the Gemma team here. Sorry for the late response -- we haven't fully open sourced our internal evaluation harness and it's interesting that the numbers you find are lower... we'll look into it!

zuom

Jul 31, 2024

Hey Surya, I've also been unable to get 0.22, I get 0.11 with greedy decoding. Could you let us know at least the sampling parameters and/or prompts used? Thanks!

timrbula

Oct 10, 2024

hey @suryabhupa any chance you have an update on this? We are also unable to replicate for humaneval and other benchmarks.

google
/

gemma-2b

Unable to reproduce the score of gemma_2b at pass@1 in humaneval.