reproducibility of santacoder

#22

by mh - opened Mar 13, 2023

Mar 13, 2023

•

edited Mar 13, 2023

I tried to evaluate human_eval set with santacoder, but the performance of santacoder was found to be different with the results of the paper.
When looking through the example, I see sometimes it generates java code though the task is to generate python code in human eval.
I loaded the model and infered following model card directions.
Anything I need to check to ensure the human_eval evaluation score?

arjunguha

BigCode org Mar 13, 2023

Thanks for your interest in SantaCoder. Could you say more about how you're running evaluation and the results you get?

Note that for the SantaCoder paper, the evaluation results were produced with MultiPL-E directly (github.com/nuprl/MultiPL-E). We did not use the BigCode evaluation harness: MultiPL-E integration into the evaluation harness is a WIP.

Mar 14, 2023

Thank you for the quick reply.
I just used https://github.com/openai/human-eval data and made the evaluation code by myself. Maybe this is why the numbers are different. I will test with MultiPL-E, too. Thanks again.

loubnabnl

BigCode org Mar 14, 2023

Can you also report your HumanEval numbers? We also evaluated on the original HumanEval with the evaluation-harness and the numbers aren't very far from MultiPL-E HumanEval version.

donghd

Mar 31, 2023

I also encounter this problem.

There are three evaluation modules to evaluate on HumanEval dataset:
(1) Official: The official module provided by OpenAI (Codex), https://github.com/openai/human-eval;
(2) bigcode_evaluation_harness: https://github.com/bigcode-project/bigcode-evaluation-harness;
(3) MultiPL-E: https://github.com/nuprl/MultiPL-E.
I find that bigcode_evaluation_harness is the same as the Official with 164 programming questions, but MultiPL-E is slightly different from the previous two modules with 161 questions.

With the MultiPL-E module, I can reproduce the result of HumanEval-Python with 0.49 PASS@100, but PASS@100 of bigcode_evaluation_harness and Official is 0.46, a bit lower than MultiPL-E.

loubnabnl

BigCode org Apr 3, 2023

The MultiPL-E Humaneval version is slightly different from the original implementation which could explain the difference you are observing, the authors add type annotations and doctests to prompts where they are missing. You can refer to the paper for more details

Apr 4, 2023

I made a mistake of mentioning human eval data. It was a problem of mbpp dataset. After I changed the mbpp set to MultiPle-E version. I see the close result from the reported score. I guess the number can vary slightly due to the randomness of generation. Thanks

christopher changed discussion status to closed Apr 4, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment