Datasets question

by Yuxin715d - opened Sep 18, 2023

Sep 18, 2023

Hi, I see in the model configuration that you used evol_instruct_code(# nickrosh/Evol-Instruct-Code-80k-v1) and megacode(# rombodawg/MegaCodeTraining112k) as datasets. But it seems that MegaCodeTraining datasets contain the previous one.

Look at this copied from MegaCodeTraining datasets repo: This is a mega combined dataset using both razent/wizardlm-code-evol-32k and nickrosh/Evol-Instruct-Code-80k-v1.

Do you think these repetitive samples will affect the performance of training?

TheBloke

Owner Sep 18, 2023

That would be a question to direct to the model trainers. I just provided the quantisations.

However I just noticed that the source model was deleted from Hugging Face. I'm not sure why that is - maybe it was deemed not very good.

I might delete my quantisations as well, now I've realised the source model is gone.

@rombodawg were you involved in the making of this source model, that used your training dataset? Do you happen to know why it was deleted?

rombodawg

Sep 18, 2023

No i did not assist in making it, nor do i know why it was deleted, that would be a question for the open assistant team. I just made the dataset, and they used it

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment