Text Generation
Transformers
PyTorch
code
gpt2
custom_code
Eval Results
text-generation-inference

reproducibility of santacoder

#22
by mh - opened

I tried to evaluate human_eval set with santacoder, but the performance of santacoder was found to be different with the results of the paper.
When looking through the example, I see sometimes it generates java code though the task is to generate python code in human eval.
I loaded the model and infered following model card directions.
Anything I need to check to ensure the human_eval evaluation score?

BigCode org

Thanks for your interest in SantaCoder. Could you say more about how you're running evaluation and the results you get?

Note that for the SantaCoder paper, the evaluation results were produced with MultiPL-E directly (github.com/nuprl/MultiPL-E). We did not use the BigCode evaluation harness: MultiPL-E integration into the evaluation harness is a WIP.

Thank you for the quick reply.
I just used https://github.com/openai/human-eval data and made the evaluation code by myself. Maybe this is why the numbers are different. I will test with MultiPL-E, too. Thanks again.

BigCode org

Can you also report your HumanEval numbers? We also evaluated on the original HumanEval with the evaluation-harness and the numbers aren't very far from MultiPL-E HumanEval version.

I also encounter this problem.

There are three evaluation modules to evaluate on HumanEval dataset:
(1) Official: The official module provided by OpenAI (Codex), https://github.com/openai/human-eval;
(2) bigcode_evaluation_harness: https://github.com/bigcode-project/bigcode-evaluation-harness;
(3) MultiPL-E: https://github.com/nuprl/MultiPL-E.
I find that bigcode_evaluation_harness is the same as the Official with 164 programming questions, but MultiPL-E is slightly different from the previous two modules with 161 questions.

With the MultiPL-E module, I can reproduce the result of HumanEval-Python with 0.49 PASS@100, but PASS@100 of bigcode_evaluation_harness and Official is 0.46, a bit lower than MultiPL-E.

BigCode org

The MultiPL-E Humaneval version is slightly different from the original implementation which could explain the difference you are observing, the authors add type annotations and doctests to prompts where they are missing. You can refer to the paper for more details

I made a mistake of mentioning human eval data. It was a problem of mbpp dataset. After I changed the mbpp set to MultiPle-E version. I see the close result from the reported score. I guess the number can vary slightly due to the randomness of generation. Thanks

christopher changed discussion status to closed

Sign up or log in to comment