loubnabnl's picture
loubnabnl HF staff
High-level review (#2)
1e77c56

CodeParrot is a code generation model trained on 50GB of pre-processed Python data from Github repositories: CodeParrot dataset. The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

  • Exact match deduplication
  • Filtering:
    • Average line length < 100 tokens
    • Maximum line length < 1000 MB
    • Alphanumeric characters fraction > 0.25
    • Remove auto-generated files (keyword search)

For more details see the preprocessing script in the transformers repository here.