loubnabnl HF staff commited on
Commit
5bdf45d
1 Parent(s): 68d0491

update description

Browse files
Files changed (1) hide show
  1. datasets/incoder.txt +3 -3
datasets/incoder.txt CHANGED
@@ -1,4 +1,4 @@
1
- [InCoder](https://huggingface.co/facebook/incoder-6B) was trained on trained on 216 GB of data from Github and Stackoverflow from 28 programming languages. 52 GB rae in Python, 107GB in other programming languages and 57GB is content from stackoverflow that isn't code.
2
 
3
  The Github data used the following filtering:
4
  - Average line length < 100
@@ -6,10 +6,10 @@ The Github data used the following filtering:
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8
 
9
- The second componenet of the data consists of questions, answers, and comments from StackOverflow, it includes:
10
  - all questions that have at least one answer
11
  - up to ten answers with a non-negative score (sorted by score) per question
12
  - up to five comments per question/answer
13
- Exact match deduplication was performed in code files.
14
 
15
  For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
 
1
+ [InCoder](https://huggingface.co/facebook/incoder-6B) was trained on trained on **216 GB** of data from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
 
3
  The Github data used the following filtering:
4
  - Average line length < 100
 
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8
 
9
+ The second component of the data consists of questions, answers, and comments from StackOverflow, it includes:
10
  - all questions that have at least one answer
11
  - up to ten answers with a non-negative score (sorted by score) per question
12
  - up to five comments per question/answer
13
+ Exact match deduplication was performed on code files.
14
 
15
  For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).