How to get the effect like Code Completion Playground above

#25
by rookielyb - opened

When I was using it, I found that the effect of Code Completion Playground is much better than the offline prediction effect of the weight of the starcoder I downloaded. Why is this, and is there any difference? At the same time, the effect of Hosted inference API is not as good as that of Code Completion Playground. OK, does it do anything special?

BigCode org

@rookielyb I'm not sure, but you may be careful about the hyper-parameter (e.g., temperature).

I have the same problem, but I am running in 8bit mode, I don't know if this is the reason

this is my code

企业微信截图_d75ee79f-5d8b-47b1-8dd2-90f08aedbf16.png
I predicted 10 times and didn't get one correct result

企业微信截图_0fdde452-8bfa-4f0e-ab56-1e434709f631.png
But I try to use your api, can get the correct result

企业微信截图_d92508a3-b65c-4886-a2a7-da07b5b8e7f4.png

企业微信截图_5fe67b3b-5dab-48c4-9f90-124ed0015831.png
Why is this? I'm having a hard time achieving your results in human eval.
Hope to get your reply!

BigCode org

@rookielyb I notice that you set the topk_k=50. Can you try again after removing the constraint? I do not think we have ever set top_k.

thank you for your reply!
Here are the latest parameters
image.png
The effect of the same parameter starcoderbase is better than that of starcoder

image.png
The left side of the figure is the weight based on starcoderbase, and the right side is based on starcoder
I found that the generated results are very sensitive to the parameters.
Is it convenient to give the parameters of the human eval results in the paper? pass@1

BigCode org

@rookielyb sure, you can already find them in the paper:

Like Chen et al. (2021), we use sampling temperature 0.2 for pass@1, and temperature 0.8 for k > 1. We generate n = 200 samples for all experiments with open-access models.

Hi, as answered in this issue the playground doesn't do anything special it calls the inference endpoint to generate code which is equivalent to doing model.generate if you use the same parameters (check the Playground's public code.

The humaneval score is 33%-40% so it's normal that the model gets some solutions wrong, if you want to reproduce the humaneval score you can run the evaluation-harness on the full benchmark instead of comparing a few problems, as specified in the issue both the paper settings provided by @SivilTaram and greedy decoding give a pass@1 of 33%-34%. (btw it helps to strip the prompt before generation if you're not doing it)

loubnabnl changed discussion status to closed

Sign up or log in to comment