Finetuning Starcoder with languages that are not present in The Stack

#98
by lazarantal - opened

Hi,

I would like to finetune StarCoder for languages that are not present in The Stack dataset. How can i prepare a custom dataset and use it with the finetuning process? The different languages of the-stack dataset are stored are parquet files. How can I generated such files for further languages?

Thanks,
Toni

I’d encourage you to take a look into StarCoder2. It’s trained with over 600 languages. Anyway, the fine tuning process should be the same wether the language is included or not in the pre training dataset

Sign up or log in to comment