File size: 455 Bytes
6dc2b45
1378d9b
 
 
 
1
2
3
4
5
[PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The model was trained on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
- Exact match deduplication 
- Filtering:
    - Average line length < 100 tokens
    - Maximum line length < 1000 MB