Spaces:
Runtime error
Runtime error
CodeParrot is a code generation model trained on 50GB of pre-processed Python data from Github repositories: CodeParrot dataset. The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
- Exact match deduplication
- Filtering:
- Average line length < 100 tokens
- Maximum line length < 1000 MB
- Alphanumeric characters fraction > 0.25
- Remove auto-generated files (keyword search)
For more details see the preprocessing script in the transformers repository here.