loubnabnl's picture
loubnabnl HF staff
High-level review (#2)
1e77c56

Codegen is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.

It was sequentially trained on three datasets:

  • The Pile
  • A 341GB subset of Google’s BigQuery dataset of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
  • 217GB of Python data from GitHub repositories

The second and third datasets used the following preprocessing:

  • Exact match deduplication
  • Filtering:
    • Exact match deduplication
    • Average line length < 100 tokens
    • Maximum line length < 1000 MB
    • Characters being decimal or hexadecimal digits >90%