finetuning for autocompletion?

#63
by rachelshalom - opened

Hey I tool a loot at the startcoder finetuning code for instructions. I would like to finetune on a private code for autocompletion. right now I have private repos and I think that the autocompletion task is the only thing I can do with that. I am having a bit of a hard time understanding what do i need to modifiy in the finetune code of startcoder. anyone has tried to modify the finetuning code for autocompletion task?
that's the code I refer to: https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py

BigCode org

Hi, you just need to change the input text, and use the content of your code files as is instead of the instruction format here.
You buffer should get

buffer.append(next(iterator)["content"])

If "content" is the name of the column that has the code you want to train on in your dataset.

Thank you, I will try that. I think I was confused because of the fact that instruction tuning is supervised and autocomplete is not. So I thought that the training would be done differently, but as I understand for both cases causal training is done. ( but instruction-responses data is concatenated and that is how it is fed to the model)

Hi @loubnabnl so you're saying to just feed the code straight into the model, without matching the input / output format. Will this break FIM or does the model somehow know how to support FIM for code files it has been trained on?

In the finetuning code you pointed , while the data preparation is similar to the pre-training procedure, it doesn't use FIM. The model might still be able to know how to perform FIM after that fine-tuning. However, if you want to preserve the same infilling capabilities you might want to include it in the training, you can check this code which uses fim, it should be easy to adapt to the starcoder repo finetuning with PEFT since both use similar a data class.

Hi @loubnabnl for this "traing on my private repo" scenario, should the validation dataset be the same as the training dataset?

BigCode org

We usually select a small percentage of the training data as a validation dataset (and that we remove from the training, see)

I am following up a similar path way where I need to train on a private codebase, but I needed some pointers on how to convert private repos into datasets that can be used in fine-tuning. Any help will be greatly appreciated

Hi @loubnabnl Even I am trying to use startcoder as a starting point, on which, I would like to train that model with private codebase and deploy the new model somewhere (HF perhaps) so that can be called from any IDE. Is there an article that details the entire process end-to-end? Finding it difficult to gather the necessary information. Any help would be greatly appreciated

Hi, anyone has tried this and has an estimate of how long it will take.
I have just started training on a dataset of 20k codes, how long it would take ?

Hey @loubnabnl -
Regarding the comment “ this code which uses fim”

Did anyone open a PR to add FIM support to the PEFT fine tuning script?

One more question, can one use this same script with CodeLlama?

Sign up or log in to comment