Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

code-generation-models / datasets /codeparrot.txt

loubnabnl's picture

loubnabnl HF staff

update datasets

873252e over 2 years ago

737 Bytes

	[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on 50GB of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
	- Exact match deduplication
	- Filtering:
	- Average line length < 100 tokens
	- Maximum line length < 1000 MB
	- Alpha numeric characters fraction > 0.25
	- Remove auto-generated files (keyword search)

	For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).