Please wait to make the next version based on my losslessdataset

by rombodawg - opened Aug 14, 2023

Aug 14, 2023

•

edited Aug 14, 2023

If you planned on using my version 2 dataset (losslessmegacodetraining) for your next model, i would highly reccomend waiting for a little while because a version 3 is coming out very soon. I already have all the data necessary to make it, i just have to do a little bit of editing to compile it when i have time. It will be closer to 2m lines of data in a 50%-50% coding non coding split, as opposed to the current losslessmegacodindatset which is at 1m lines and only 25%-75% coding-noncoding split

rombodawg

Sep 10, 2023

@juyongjiang I dont know if you are active anymore, but I made a few new datasets I would recommend you finetuning either the codellama-python-13b, or wizardcoder-python-13b models on to create your next model. I personally would use the wizardcoder model to start with since it already has great coding performance, and go up from there with my dataset.

For code only:

https://huggingface.co/datasets/rombodawg/LimitlessCodeTraining_Guanaco_Format

For code + Non code instructions in a 80%/20% split:

https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment