Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

loubnabnl HF Staff commited on May 25, 2022

Commit

44b6c59

1 Parent(s): a5b4c8d

add dataset intro

Browse files

Files changed (1) hide show

datasets/intro.txt +1 -0

datasets/intro.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which contains code and an important proportion of natural text. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance. For model-specific information about the pretraining dataset, please select a model below: