True humaneval score or NewHope repeat?
How can we be sure that this model is actually beating gpt-4 because it was trained super well, and not because humaneval data was leaked into your training data for the model? Have you made sure to remove any training data from the dataset before training the model?
Yes, we made sure to check for decontamination and we found none. Our training set is quite different from Humaneval and mostly helps align it, which shows that CodeLlama-34B is already quite strong.
We use the same decontamination process as OpenAI: https://www.phind.com/blog/code-llama-beats-gpt4.
@michaelroyzen one more question, do you plan on releasing the dataset? or is it remaining closed source?
I am also wondering whether dataset will be released
Not at this time. It's part of our secret sauce. But we plan to continue releasing models -- stay tuned for v2 in a few days.