code-generation / datasets /codeparrot.md
IntelligenzaArtificiale's picture
Duplicate from codeparrot/code-generation-models
baf7f9c
|
raw
history blame
751 Bytes

CodeParrot is a code generation model trained on 50GB of pre-processed Python data from Github repositories: CodeParrot dataset. The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:

  • Exact match deduplication
  • Filtering:
    • Average line length < 100 tokens
    • Maximum line length < 1000 MB
    • Alphanumeric characters fraction > 0.25
    • Remove auto-generated files (keyword search)

For more details see the preprocessing script in the transformers repository here.