Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

loubnabnl HF Staff commited on May 27, 2022

Commit

75fc24e

1 Parent(s): c3ea8fa

update datasets

Browse files

Files changed (1) hide show

datasets/intro.txt +1 -1

datasets/intro.txt CHANGED Viewed

@@ -1 +1 @@

- Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), ~~which~~ ~~contains~~ code and an ~~important~~ ~~proportion~~ of ~~natural text~~. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.

+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), it is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.