Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
lewtun 
posted an update Mar 12
Post
Can we align code generation models to be good at chat without compromising their base capabilities 🤔?

This was the question the H4 team asked itself when BigCode released StarCoder2 a bit over a week ago. We knew that code models like deepseek-ai/deepseek-coder-6.7b-instruct and m-a-p/OpenCodeInterpreter-DS-33B get impressive scores on code benchmarks like HumanEval, but they tend to score poorly on chat benchmarks like MT Bench and IFEval. We also knew that the Zephyr recipe we applied to Mistral 7B produced a strong chat model, so we wondered -- could be tweaked to produce a strong coding assistant?

It turns out the answer is yes and I'm happy to share StarChat2, a DPO fine-tune of StarCoder2 15B that scores highly on both HumanEval and MT Bench / IFEval 🌟!

The most interesting lesson for me was that you get better models by blending in more code/math data than chat during the SFT step - in terms of tokens, we found a ratio of 3:1 worked best.

Anyway, here's a demo of the model, along with all the code and datasets we used to train it:

* Demo: HuggingFaceH4/starchat2-playground
* Collection: HuggingFaceH4/starchat2-15b-65f068417b330fafad751fce
* Recipe: https://github.com/huggingface/alignment-handbook

Hope it's useful to others!

This is great! But, how is this different from OpenChat-3.5-0106, a chat model that scores 7.8 on MTBench but also scores 71 on HumanEval?

·

Great question - as far as HumanEval is concerned, I think the main advantage of using StarCoder2 as a base model over Mistral 7B is that the base perf is much stronger, so you need less code samples on the SFT side to see an improvement. Another thing to note is that some of the public SFT datasets bear a strong similarity to HumanEval prompts, so you can get a surprisingly large score from just a few samples in the training set.

AFAIK the datasets used to train OpenChat aren't public, but my hunch is that if you take our recipe and apply it to Mistral then you'll end up with a worse performing model on HumanEval (needs evidence though!)