New Dataset

#11
by Raspbfox - opened

It seems to be an open data-set of high quality with the size comparable to the LLaMa paper numbers! Sounds exciting!

i did already download the dataset and i am on the way on cleaning it up :-) , After cleaning all data up and having nearly 100% good sentences, i have in mind to run every sentence from the dataset through NLLB and creaate the same dataset for 200 languages. Then we probably have a very good base for multilingual LLM's. but it will take a huge ammount of time and i am already thinking on how i will store the 200+GB of data :-( Any idea on that @Raspbfox ?

Hah, I know it's obvious, but, compression? Text compresses really, really well!
Some compression algos store the metadata to still be navigable!

How would you rate the quality of the dataset @snapo ?

from a first look the data does not look that great ... thats why i have to filter and take out only english sentences, after that i have to filter out all code so that only would learn language. to learn code i would create a separate dataset from github data that is under Apache, GPL licensed code because thats the only thing that is free to used for corporations. will let you know on my fiiltering progress :-) (Never worked with so much data before).... for the code to have also translations i have to filter out comments and translate the comments without the corresponding variables,function names (will be kinda difficult). When all that is done then the instructions can be learned with fine tuning and either manual translation or automatic translation with NLLB (Translation on the bigger models of NLLB is very good!). I need at least 1 month to even be able to make an estimate on how big the dataset resulting with 200 languages will be :-)

image.png
small example, mix of unicode and json, also some downloads had multiple restarts because of hash mismatches.... each dataset type (location) has a different layout in json, around 800GB is already ZST compressed :-) therefore more compression might not work hehehehe , lets see how it plays out.

Oh and some text files contain swear words...
image.png

Just as a simplification, I am not sure it would be a good sink of time to translate code comments, as it's considered a "bad smell" to not have your comments in English anyway :D
Unless, of course, it's beneficial for the model to later generalize the knowledge about languages and their translations.

Do you know if there is a limit on total repo size for datasets in huggingface? @Raspbfox

No idea, tbh, will need to Google that :D

Sign up or log in to comment