Whats coding datasets were used?

#3
by rombodawg - opened

Which coding datasets were used exactly to train this model? Could you provide a list? Are they all open source and accessible here on hugging face or are they private?

NousResearch org

everything is in the model card

all the data is open source and accessible on HF except for about 50k instructions from gpt-4, of which ~45k are general alpaca instructions from basic seed tasks, and ~4500 are specialized.

not really insane at code, yet. we will get better :)

Gotcha, well feel free to use my data set to train it if you want. I plan on expanding the dataset in the future as well
Link:
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k

NousResearch org

Gotcha, well feel free to use my data set to train it if you want. I plan on expanding the dataset in the future as well
Link:
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k

Thank you will check it out!

Does it support languages other than English? Does the training dataset include linguistic data?

Sign up or log in to comment