datasets/codegen.md · codeparrot/code-generation-models at b96836ab9d6fb66cbff9c7dd7c82f5318f208294

Codegen is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.

It was sequentially trained on three datasets:

The Pile
A 341GB subset of Google’s BigQuery dataset of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
217GB of Python data from GitHub repositories

The second and third datasets used the following preprocessing:

Exact match deduplication
Filtering:
- Exact match deduplication
- Average line length < 100 tokens
- Maximum line length < 1000 MB
- Characters being decimal or hexadecimal digits >90%