Please wait to make the next version based on my losslessdataset

#2
by rombodawg - opened

If you planned on using my version 2 dataset (losslessmegacodetraining) for your next model, i would highly reccomend waiting for a little while because a version 3 is coming out very soon. I already have all the data necessary to make it, i just have to do a little bit of editing to compile it when i have time. It will be closer to 2m lines of data in a 50%-50% coding non coding split, as opposed to the current losslessmegacodindatset which is at 1m lines and only 25%-75% coding-noncoding split

@juyongjiang I dont know if you are active anymore, but I made a few new datasets I would recommend you finetuning either the codellama-python-13b, or wizardcoder-python-13b models on to create your next model. I personally would use the wizardcoder model to start with since it already has great coding performance, and go up from there with my dataset.

For code only:

For code + Non code instructions in a 80%/20% split:

Sign up or log in to comment