loubnabnl HF staff commited on
Commit
bbe8882
1 Parent(s): 8a824dc

Update datasets/polycoder.md

Browse files
Files changed (1) hide show
  1. datasets/polycoder.md +1 -1
datasets/polycoder.md CHANGED
@@ -1,4 +1,4 @@
1
- The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens
 
1
+ The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **249GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens