File size: 1,020 Bytes
1e77c56
c9e8e4a
1e77c56
c3ea8fa
 
c9e8e4a
 
 
1e77c56
c9e8e4a
34741ae
c9e8e4a
1e77c56
264a5f8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.

The GitHub data was cleaned with the following steps:
- Average line length < 100 tokens
- Maximum line length < 3000 MB
- Alphanumeric characters fraction > 0.4 
- Remove auto-generated files (keyword search)

The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
- all questions that have at least one answer
- up to ten answers with a non-negative score (sorted by score) per question
- up to five comments per question/answer

Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).