Model surprisingly good at EvalPlus

by cristian-rodriguez - opened

I am checking the following benchmarks :

  • EvalPlus
  • LiveCodeBench
    CodeQwen performs top-5 in the first but top-25 in the second. LiveCodeBench is supposed to avoid data contamination.

I wonder if we can then (probably) assume some data contamination on the training ? Or does anybody have a reasonable explanation for this gap?


Qwen org

We strictly enforced data contamination during the training.

Sign up or log in to comment