osanseviero's picture
osanseviero HF staff
Review blog post
c26d8af

CodeParrot is a code generation model trained on 50GB of pre-processed Python data from Github repositories: CodeParrot dataset. The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

  • Exact match deduplication
  • Filtering:
    • Average line length < 100 tokens
    • Maximum line length < 1000 MB
    • Alphanumeric characters fraction > 0.25
    • Remove auto-generated files (keyword search)

For more details see the preprocessing script in the transformers repository here.