loubnabnl HF staff commited on
Commit
44b6c59
1 Parent(s): a5b4c8d

add dataset intro

Browse files
Files changed (1) hide show
  1. datasets/intro.txt +1 -0
datasets/intro.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which contains code and an important proportion of natural text. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance. For model-specific information about the pretraining dataset, please select a model below: