loubnabnl HF staff commited on
Commit
f1f9d12
1 Parent(s): 9acda8b
Files changed (1) hide show
  1. datasets/codeparrot.txt +1 -1
datasets/codeparrot.txt CHANGED
@@ -1,4 +1,4 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens
 
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens