Human eval test data might have leaked into training data

#3
by matorus - opened

It seems like the canonical solutions from the human eval test dataset have leaked into the training data.
In my case, 57% of the canonical solutions are contained in the code completions generated by the model.
For other models this value is well below 10%.

replit_glaive: 56.71%
replit: 7.32%
wizard: 4.88%

Here is the validation code: https://github.com/torusresearch/code-eval/blob/main/validate.py
(You need to run python eval_[model_name].py before running python validate.py. See repository README.)

Interesting, I'm working on publishing the dataset as well soon so would be good analyse that for the canonical solutions because I couldn't find any exact matches

Note that I trimmed the canonical solutions from white spaces at front and back.
if problem["canonical_solution"].strip() in r["completion"]:
Otherwise it would not match anything. (I think because the canonical solutions contain some additional new lines at the end which are not present in generated code.)

image.png

It's almost the same. Can we verify the dataset integrity indenpendently?

Any update on it?

Sign up or log in to comment