loubnabnl HF staff commited on
Commit
873252e
1 Parent(s): 2b3c79e

update datasets

Browse files
Files changed (1) hide show
  1. datasets/codeparrot.txt +4 -4
datasets/codeparrot.txt CHANGED
@@ -1,8 +1,8 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
- - Filtering
4
- - Average line length < 100
5
- - Maximum line length < 1000
6
  - Alpha numeric characters fraction > 0.25
7
  - Remove auto-generated files (keyword search)
8
 
 
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
+ - Filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 1000 MB
6
  - Alpha numeric characters fraction > 0.25
7
  - Remove auto-generated files (keyword search)
8