File size: 873 Bytes
8a43dba
c9e8e4a
 
 
 
 
 
 
5bdf45d
c9e8e4a
34741ae
c9e8e4a
5bdf45d
c9e8e4a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[InCoder](https://huggingface.co/facebook/incoder-6B) was trained on **216 GB** of data from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.

The Github data used the following filtering:
- Average line length < 100
- Maximum line length < 3000 
- Alphanumeric characters fraction > 0.4 
- Remove auto-generated files (keyword search)

The second component of the data consists of questions, answers, and comments from StackOverflow, it includes:
- all questions that have at least one answer
- up to ten answers with a non-negative score (sorted by score) per question
- up to five comments per question/answer
Exact match deduplication was performed on code files.

For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).