code-generation / datasets /polycoder.md
IntelligenzaArtificiale's picture
Duplicate from codeparrot/code-generation-models
baf7f9c
The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **249GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
- Exact match deduplication
- Filtering:
- Average line length < 100 tokens
- Maximum line length < 1000 MB