Human eval test data might have leaked into training data
It seems like the canonical solutions from the human eval test dataset have leaked into the training data.
In my case, 57% of the canonical solutions are contained in the code completions generated by the model.
For other models this value is well below 10%.
replit_glaive: 56.71%
replit: 7.32%
wizard: 4.88%
Here is the validation code: https://github.com/torusresearch/code-eval/blob/main/validate.py
(You need to run python eval_[model_name].py
before running python validate.py
. See repository README.)
Interesting, I'm working on publishing the dataset as well soon so would be good analyse that for the canonical solutions because I couldn't find any exact matches
Note that I trimmed the canonical solutions from white spaces at front and back.if problem["canonical_solution"].strip() in r["completion"]:
Otherwise it would not match anything. (I think because the canonical solutions contain some additional new lines at the end which are not present in generated code.)
Any update on it?