datasets/polycoder.md · codeparrot/code-generation-models at 1d5159de8a5836d0ec558e32b8f4713e8fcc65d8

The PolyCoder paper gives a nice comparison of existing code models. The authors also trained a code generation model on 249GB of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:

Exact match deduplication
Filtering:
- Average line length < 100 tokens
- Maximum line length < 1000 MB