Model surprisingly good at EvalPlus
#17
by
cristian-rodriguez
- opened
Hi,
I am checking the following benchmarks :
- EvalPlus
- LiveCodeBench
CodeQwen performs top-5 in the first but top-25 in the second. LiveCodeBench is supposed to avoid data contamination.
I wonder if we can then (probably) assume some data contamination on the training ? Or does anybody have a reasonable explanation for this gap?
Thanks
We strictly enforced data contamination during the training.