Can you create a version with a bigger dataset with more coding languages and capabilities? I can provide the data

#1
by rombodawg - opened

As the title says I have created a dataset that would be perfect to train the V2 of your model.
I would have made the model myself but i simply dont have the recourses to create it sadly.
Here is the dataset i am talking about bellow, feel free to use it.
https://huggingface.co/datasets/rombodawg/MegaCodeTraining112k

Awesome! Thanks for your kindness.
Okay, let me retrain a better version with your dataset as soon as possible.
When it is available, I will let you know. : )

Any progress? I have a new dataset now too:

https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored

Cool! Almost done! 1638/1840, I guess deepse/CodeUp-Llama-2-13b-chat-hf equipped with 190K code instruction data will be available by the day after tomorrow. : )

@rombodawg How can I contact you privately? My email is csjuyongjiang@gmail.com . I have a plan for the data and want to discuss it with you more. : )

Message me on discord
My discord name is rombodawg

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

Is this a question for me or juyongjiang? Because i dont have powerful pc hardware for training. I made my datasets with notepad++

Is this a question for me or juyongjiang? Because i dont have powerful pc hardware for training. I made my datasets with notepad++

Ya, aiming for juyongjiang. Sorry I didn't call it.

Notepad++ has some nice column editing features that have come in handy in the past. Unfortunately, some of my datasets have outgrown Notepad++'s ability to edit them, I have to edit the big datasets with scripts. Speaking of Scripts, how have you been uncensoring your datasets? Are you using scripts like the users ehartford, ewof, or anon8231489123 use on their datasets?

Ive been creating my own custom scripts using chat gpt borrowing some code from sources like you mentioned.

This seems like a great model in the works and I'm very interested in taking it for a test drive. Are you going to update these model files or release a v2 model when this round of training is done?

If you are interested, I have some ideas, scripts, and the beginnings of some datasets I've made from scratch to give broader knowledge to LLMs on several subjects that it seems every coding model is weak on. There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets.

I only have CPU rigs with large ram to train with, and it's too slow to be practical so, I've been looking for someone with access to GPU rigs. If you'd like to collaborate, let me know.

@JuLuComputing

Awesome!!

"There are several things I am building datasets around, such as, more current knowledge on code for LM development, training efficiency improvements, and groundbreaking projects that add new technologies. I am also focusing on stretching coding context with larger scripting examples, rather than small and narrow scripting examples that are common in most datasets."

I am very interested in it. How can I contact you privately? Let us discuss more details. : )

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

Fantastic! I'm interested in trying it out.

Also, I just now sent you an email about collaboration. :-)

@JuLuComputing Sure, I will release it as CodeUp-alpha, almost done! 1790/1840. : )

Fantastic! I'm interested in trying it out.

Also, I just now sent you an email about collaboration. :-)

@JuLuComputing Great! I have released the CodeUp-alpha-13b-hf. Please have a try, https://huggingface.co/deepse/CodeUp-alpha-13b-hf. : )

@juyongjiang
Awesome, can you link my dataset in your readme right after the license like this:

datasets:

  • rombodawg/Legacy_MegaCodeTraining200k

Im assuming you used that version if not you can link the updated uncensored version like this:

datasets:

  • rombodawg/2XUNCENSORED_MegaCodeTraining188k

@rombodawg Cool! No problem. Done!

@juyongjiang Ive made a much bigger and better coding dataset if you are interested in making a version 2
https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV3_2.2m_Evol

Sign up or log in to comment