Text Generation
Transformers
Safetensors
llama
text-generation-inference
4-bit precision
gptq

Datasets question

#2
by Yuxin715d - opened

Hi, I see in the model configuration that you used evol_instruct_code(# nickrosh/Evol-Instruct-Code-80k-v1) and megacode(# rombodawg/MegaCodeTraining112k) as datasets. But it seems that MegaCodeTraining datasets contain the previous one.

Look at this copied from MegaCodeTraining datasets repo: This is a mega combined dataset using both razent/wizardlm-code-evol-32k and nickrosh/Evol-Instruct-Code-80k-v1.

Do you think these repetitive samples will affect the performance of training?

That would be a question to direct to the model trainers. I just provided the quantisations.

However I just noticed that the source model was deleted from Hugging Face. I'm not sure why that is - maybe it was deemed not very good.

I might delete my quantisations as well, now I've realised the source model is gone.

@rombodawg were you involved in the making of this source model, that used your training dataset? Do you happen to know why it was deleted?

No i did not assist in making it, nor do i know why it was deleted, that would be a question for the open assistant team. I just made the dataset, and they used it

Sign up or log in to comment