Is there anything about the training data that makes this specifically better at Java and C++?

#1
by jukofyork - opened

Hi, just converting this model to GGUF format now and have a couple of questions.

From: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

humaneval-python = 76.83
java = 60.76
javascript = 66.46
cpp = 65.22

Is there anything about the training data that makes this specifically better at Java and C++? This seems to be the first recently fine-tuned coding model I've seen that isn't massively biased towards Python (to game the humaneval-python benchmarks, etc). The recent WizardCoder-33B-V1.1, which is also fine-tuned from Deepseek-Coder-33B, is so over-trained on Python that it tries to convert everything it's given in C++ or Java into Python, and is basically unusable for anything else!!!

I will give it a try and report back on how I get on.

Sadly I don't have enough upload bandwidth to upload the GGUF(s), but hopefully @TheBloke or @LoneStriker will convert it soon as a non-Python targeted fine-tune could be very useful to a lot of people.

CodeFuse AI org

I'm sorry for not being able to respond in time.
In the training of this model, we used unit test generation data containing Java/C++ and code practice exercises (also containing Java/C++) we constructed (referencing the PHI-Textbook work). We have published an article on WeChat's official accounts which contains more information; however, I apologize that it is written in Chinese https://mp.weixin.qq.com/s/2Ddm7-aUJuEnsESSxkmkGg

CodeFuse AI org

I have translated the introduction of the data used for fine-tuning into English:

image.png

twelveand0 changed discussion status to closed

Sign up or log in to comment