File size: 977 Bytes
baf7f9c
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system. 

It was sequentially trained on three datasets:
- [The Pile](https://huggingface.co/datasets/the_pile)
- A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python 
- 217GB of Python data from GitHub repositories 

The second and third datasets used the following preprocessing:
- Exact match deduplication 
- Filtering:
    - Exact match deduplication 
    - Average line length < 100 tokens
    - Maximum line length < 1000 MB
    - Characters being decimal or hexadecimal digits >90%