Please consider my toiled over coding dataset for fine tuning a 1.1 version of the wizard coder series.

#22
by rombodawg - opened

@WizardLM
Hello you can call me rombodawg. I have worked for the past few months on the LosslessMegaCodetraining datasets that you can find on my hugging face page, and have refined them to the most peak that they have ever been to this day. I ask that you thoroughly consider using one of (or experiment and use both then test the results) my datasets. which i will link bellow with a brief description, in order to either further fine tune your wizardcoder-python series of models (7b, 13b, 34b) or combine with your existing dataset and re-finetune the codellama models that you originally used to create the wizardcoder-python series.

Let me give you a background on why my datasets are worth your while. I have created the LosslessMegaCoder-llama2-13b-mini ai model using the Version 2 of my LosslessMegaCodeTraining dataset. This model if you look at the "Can ai code" leaderboard actually beats the wizardcoder-python ai model in python code generation if this leaderboard is to be trusted.

"Can ai code" for reference in above statement.
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

However, i believe the Version 2 of my dataset is actually considered "weak-sauce" compared to the version 3. The reason? Bigcode's commitpackft! I have converted bigcode's commitpackft to alpaca format, and taken much time to remove any errors caused by the vast amount of over 250 coding languages in the dataset when trying to train models with it in the new format. Along with this I have also added some of the more refined non-coding datasets such as open-platypus, and (in the mini version) only airoboros 2.1 dataset (I will explain this in more detail bellow).

  • My philosophy (LosslessCoding):
    The idea behind lossless coding is simple, train ai with coding and non-coding knowledge at the same time, and the dont lose reasoning, and logic abilities. The was an issue with the old wizardcoder model (15b) as well as the NewHope model that promised high levels of coding performance. I dont know if you follow this type of ideals of not when training your models but I have provided multiple datasets (both Lossless, and purely coding) for your convenience at the bottom.

Thats enough rambling for now, let me get to the meat and potatoes. What are the two datasets that make up LosslessMegaCodeTrainingVersion3, bellow you will finds brief descriptions of how they are diffrent. I highly recommend using one of these two datasets, however if you do not follow my philosophy of the "LosslessCoding" that i have mentioned above, i will link two more datasets that are purely meant for code only, you can combine these two and they will be only coding data without any non-coding data. However only commitpackft has been filtered so that dataset is the best version to use if you want pure, filtered coding data.

LosslessMegaCodeTrainingV3_1.6m_Evol
This dataset is comprised of the entire Version 2 dataset, that is mentioned above, in addition to the commipackft conversion that i made, as well as openplatypus.

link:

LosslessMegaCodeTrainingV3_MINI
This dataset actually doesnt contain any data from Version 2, its only comprised of commitpackft, openplatypus, and airoboros 2.1 datasets. This ensures the lower quality data from Version 2 was left out (because it has not been manually filtered nor filtered by ai) and only uses data that has been filtered.

link:


  • Pure coding datasets:

2XUNCENSORED_MegaCodeTraining188k
The unfiltered coding data present in Version 2 of losslessmegacodetraining.

link:

Rombodawgs_commitpackft_Evolinstruct_Converted
The 100% pure, perfect filtered data, from bigcode, converted to alpaca format.

link:

Hello once again. I am pleased to inform you that I have released an a new dataset, the successor to Megacodetraining. It is called Limitlesscodetraining. Its is (as far as my knowledge) the purest, most refined and filtered coding dataset on hugginface. Feel free to use it to further finetune your wizard coder dataset, or train your model.

link:

WizardLM changed discussion status to closed

Can you guys reply and give me your thoughts before closing the discussion? Id like to know id you plan on using my datasets at all

rombodawg changed discussion status to open
WizardLM Team org

Thanks, we will not use.

WizardLM changed discussion status to closed

Sign up or log in to comment