Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

code-generation-models / datasets /incoder.txt

Loubna ben allal

add files

c9e8e4a over 2 years ago

883 Bytes

	[InCoder](https://huggingface.co/facebook/incoder-6B) was trained on trained on 216 GB of data from Github and Stackoverflow from 28 programming languages. 52 GB rae in Python, 107GB in other programming languages and 57GB is content from stackoverflow that isn't code.

	The Github data used the following filtering:
	- Average line length < 100
	- Maximum line length < 3000
	- Alphanumeric characters fraction > 0.4
	- Remove auto-generated files (keyword search)

	The second componenet of the data consists of questions, answers, and comments from StackOverflow, it includes:
	- all questions that have at least one answer
	- up to ten answers with a non-negative score (sorted
	by score) per question
	- up to five comments per question/answer
	Exact match deduplication was performed in code files.

	For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).