sahil2801/replit-code-instruct-glaive · Human eval test data might have leaked into training data

Jul 6, 2023

It seems like the canonical solutions from the human eval test dataset have leaked into the training data.
In my case, 57% of the canonical solutions are contained in the code completions generated by the model.
For other models this value is well below 10%.

replit_glaive: 56.71%
replit: 7.32%
wizard: 4.88%

Here is the validation code: https://github.com/torusresearch/code-eval/blob/main/validate.py
(You need to run python eval_[model_name].py before running python validate.py. See repository README.)

sahil2801

Owner Jul 6, 2023

Interesting, I'm working on publishing the dataset as well soon so would be good analyse that for the canonical solutions because I couldn't find any exact matches

matorus

Jul 6, 2023

Note that I trimmed the canonical solutions from white spaces at front and back.
if problem["canonical_solution"].strip() in r["completion"]:
Otherwise it would not match anything. (I think because the canonical solutions contain some additional new lines at the end which are not present in generated code.)

anon13370

Aug 5, 2023

It's almost the same. Can we verify the dataset integrity indenpendently?

sarvghotra

Aug 9, 2023

Any update on it?