loubnabnl HF staff commited on
Commit
1378d9b
1 Parent(s): 75fc24e

update datasets

Browse files
Files changed (2) hide show
  1. datasets/codegen.txt +17 -0
  2. datasets/polycoder.txt +5 -0
datasets/codegen.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Codegen is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
2
+
3
+ It was was sequentially trained on three datasets:
4
+ - The Pile
5
+ - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
+ - 217GB of Python data from Github repositories
7
+
8
+ The second and third datasets used the following preprocessing:
9
+ - Exact match deduplication
10
+ - Filtering:
11
+ - Exact match deduplication
12
+ - Average line length < 100 tokens
13
+ - Maximum line length < 1000 MB
14
+ - >90% of the characters being decimal or hexadecimal digits
15
+
16
+ **Remark**:
17
+ The reported data sizes are after preprocessing.
datasets/polycoder.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ [PolyCoder paper ](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The model was trained on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
+ - Exact match deduplication
3
+ - Filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 1000 MB