codeparrot (CodeParrot)

Organization Card

Check the new instruction-tuning resources:

InstructHumanEval: a variant of HumanEval benchamrk adapted for instruction-tuned models InstructHumanEval
Full Curated CoNaLa: we used UL2 to rewritte more than 590k uncurated intents in CoNaLa dataset conala-mined-curated
Self-Instruct with StarCoder: we release a selft-instruct dataset generated with StarCoder, as weel as the code we used to build it self-instruct-starcoder
Models trained on CoNaLa and self-instruct StarCoder: we release a the models we trained on the previous two datasets.

This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. For advanced Code Language Models and pre-training datasets we recommend checking our work in the BigCode organization. Here you can find:

Interactive blog: where we compare different code models and explain how they are trained and evaluated Code generation with 🤗
Spaces:
- Code generation with: CodeParrot (1.5B), InCoder (6B) and CodeGen (6B)
- Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.

Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.
Metrics: APPS metric for the evaluation of code models on APPS benchmark.
Datasets:
- 1- codeparrot-clean, dataset on which we trained and evaluated CodeParrot, the splits are available under codeparrot-clean-train and codeparrot-clean-valid.
- 2- A more filtered version of codeparrot-clean under codeparrot-train-more-filtering and codeparrot-train-more-filtering.
- 3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under codeparrot-train-near-deduplication and codeparrot-train-near-deduplication.
- 4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under codeparrot-train-v2-near-dedup and codeparrot-valid-v2-near-dedup.
- 5- GitHub-Code, a 1TB dataset of 32 programming languages from GitHub files.
- 6- GitHub-Code-Clean, a cleaner version of GitHub-Code dataset.
- 7- GitHub-Jupyter, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.
- 8- github-jupyter-text-code-pairs, a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.
- 9- APPS, a benchmark for code generation with 10000 problems.
- 10- CodeComplex, an annotated dataset of 4,200 Java codes and their time complexity.
- 11- XLCOST-text-to-code, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.