Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

loubnabnl HF Staff

osanseviero commited on Jun 3, 2022

Commit

1e77c56

1 Parent(s): 73353ee

High-level review (#2)

Browse files

- Review blog post (c26d8af61097bd77ce5471ffaee24ccc762e9a70)

Co-authored-by: Omar Sanseviero <osanseviero@users.noreply.huggingface.co>

Files changed (20) hide show

app.py +10 -10
architectures/{codegen.txt → codegen.md} +6 -6
architectures/{codeparrot.txt → codeparrot.md} +3 -3
architectures/{incoder.txt → incoder.md} +3 -3
architectures/{intro.txt → intro.md} +1 -1
architectures/{polycoder.txt → polycoder.md} +1 -1
datasets/{codegen.txt → codegen.md} +1 -4
datasets/{codeparrot.txt → codeparrot.md} +2 -2
datasets/{github_code.txt → github_code.md} +0 -0
datasets/{incoder.txt → incoder.md} +4 -3
datasets/intro.md +3 -0
datasets/intro.txt +0 -3
datasets/polycoder.md +5 -0
datasets/polycoder.txt +0 -5
evaluation/{demo_humaneval.txt → demo_humaneval.md} +0 -0
evaluation/{intro.txt → intro.md} +0 -0
generation/{intro.txt → intro.md} +1 -1
utils/{intro.txt → intro.md} +2 -2
utils/{resources.txt → resources.md} +0 -0
utils/{table_contents.txt → table_contents.md} +0 -0

app.py CHANGED Viewed

@@ -62,40 +62,40 @@ def generate_code_threads(
 st.set_page_config(page_icon=":laptop:", layout="wide")
-with open("utils/table_contents.txt", "r") as f:
     contents = f.read()
 st.sidebar.markdown(contents)
 # Introduction
 st.title("Code generation with 🤗")
-read_markdown("utils/intro.txt")
 # Code datasets
 st.subheader("1 - Code datasets")
-read_markdown("datasets/intro.txt")
-read_markdown("datasets/github_code.txt")
 col1, col2 = st.columns([1, 2])
 with col1:
     selected_model = st.selectbox("", MODELS, key=1)
-read_markdown(f"datasets/{selected_model.lower()}.txt")
 # Model architecture
 st.subheader("2 - Model architecture")
-read_markdown("architectures/intro.txt")
 col1, col2 = st.columns([1, 2])
 with col1:
     selected_model = st.selectbox("", MODELS, key=2)
-read_markdown(f"architectures/{selected_model.lower()}.txt")
 # Model evaluation
 st.subheader("3 - Code models evaluation")
-read_markdown("evaluation/intro.txt")
-read_markdown("evaluation/demo_humaneval.txt")
 # Code generation
 st.subheader("4 - Code generation ✨")
-read_markdown("generation/intro.txt")
 col1, col2, col3 = st.columns([7, 1, 6])
 with col1:
     st.markdown("**Models**")

 st.set_page_config(page_icon=":laptop:", layout="wide")
+with open("utils/table_contents.md", "r") as f:
     contents = f.read()
 st.sidebar.markdown(contents)
 # Introduction
 st.title("Code generation with 🤗")
+read_markdown("utils/intro.md")
 # Code datasets
 st.subheader("1 - Code datasets")
+read_markdown("datasets/intro.md")
+read_markdown("datasets/github_code.md")
 col1, col2 = st.columns([1, 2])
 with col1:
     selected_model = st.selectbox("", MODELS, key=1)
+read_markdown(f"datasets/{selected_model.lower()}.md")
 # Model architecture
 st.subheader("2 - Model architecture")
+read_markdown("architectures/intro.md")
 col1, col2 = st.columns([1, 2])
 with col1:
     selected_model = st.selectbox("", MODELS, key=2)
+read_markdown(f"architectures/{selected_model.lower()}.md")
 # Model evaluation
 st.subheader("3 - Code models evaluation")
+read_markdown("evaluation/intro.md")
+read_markdown("evaluation/demo_humaneval.md")
 # Code generation
 st.subheader("4 - Code generation ✨")
+read_markdown("generation/intro.md")
 col1, col2, col3 = st.columns([7, 1, 6])
 with col1:
     st.markdown("**Models**")

architectures/{codegen.txt → codegen.md} RENAMED Viewed

@@ -1,18 +1,18 @@
-[CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono) architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
 <div align="center">
 |Model | # parameters |
 |   -   |   -  |
-| Decoder | 350M |
-| Decoder | 2.7B |
-| Decoder | 6.1B |
-| Decoder | 16.1B |
 </div>
-You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM

+The CodeGen architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
 <div align="center">
 |Model | # parameters |
 |   -   |   -  |
+| [Salesforce/codegen-350m-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 350M |
+| [Salesforce/codegen-2B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 2.7B |
+| [Salesforce/codegen-6B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 6.1B |
+| [Salesforce/codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 16.1B |
 </div>
+You can load the model and tokenizer directly from 🤗 [`transformers`](https://huggingface.co/docs/transformers/index):
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM

architectures/{codeparrot.txt → codeparrot.md} RENAMED Viewed

@@ -1,11 +1,11 @@
-[CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. We released this model as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of 🤗 [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
 <div align="center">
 |Model | # parameters |
 |   -   |   -  |
-| GPT2 | 110M |
-| GPT2 | 1.5B |
 </div>

+[CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. This model was released as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of 🤗 [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
 <div align="center">
 |Model | # parameters |
 |   -   |   -  |
+| [codeparrot-small](https://huggingface.co/lvwerra/codeparrot-small) | 110M |
+| [codeparrot](https://huggingface.co/lvwerra/codeparrot) | 1.5B |
 </div>

architectures/{incoder.txt → incoder.md} RENAMED Viewed

@@ -3,8 +3,8 @@
 |Model | # parameters |
 |   -   |   -  |
-| Decoder |1.3B |
-| Decoder |6.7B |
 </div>
@@ -17,7 +17,7 @@ During the training of InCoder, spans of code were randomly masked and moved to
 So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
-You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
 ```python
 from transformers import AutoTokenizer, AutoModelWithLMHead

 |Model | # parameters |
 |   -   |   -  |
+| [facebook/incoder-1B](https://huggingface.co/facebook/incoder-1B) |1.3B |
+| [facebook/incoder-6B](https://huggingface.co/facebook/incoder-6B) |6.7B |
 </div>
 So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
+You can load the model and tokenizer directly from 🤗 [`transformers`](https://huggingface.co/docs/transformers/index):
 ```python
 from transformers import AutoTokenizer, AutoModelWithLMHead

architectures/{intro.txt → intro.md} RENAMED Viewed

	@@ -1,2 +1,2 @@
1	Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
2	- that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about ~~the~~ architecture, please select a model below:


1	Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
2	+ that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about each architecture, please select a model below:

architectures/{polycoder.txt → polycoder.md} RENAMED Viewed

@@ -11,4 +11,4 @@
 </div>
-PolyCoder is currently being integrated in `transformers`. Meanwhile it can be loaded following the instructions in the original Github [repo](https://github.com/vhellendoorn/code-lms#models).


11	</div>
12
13
14	+ PolyCoder is currently being integrated in 🤗 `transformers`. Meanwhile it can be loaded following the instructions in the original GitHub [repo](https://github.com/vhellendoorn/code-lms#models).

datasets/{codegen.txt → codegen.md} RENAMED Viewed

@@ -3,7 +3,7 @@
 It was sequentially trained on three datasets:
 - [The Pile](https://huggingface.co/datasets/the_pile)
 - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
-- 217GB of Python data from Github repositories
 The second and third datasets used the following preprocessing:
 - Exact match deduplication
@@ -12,6 +12,3 @@ The second and third datasets used the following preprocessing:
     - Average line length < 100 tokens
     - Maximum line length < 1000 MB
     - Characters being decimal or hexadecimal digits >90%
-**Remark**:
-The reported data sizes are after preprocessing.

 It was sequentially trained on three datasets:
 - [The Pile](https://huggingface.co/datasets/the_pile)
 - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
+- 217GB of Python data from GitHub repositories
 The second and third datasets used the following preprocessing:
 - Exact match deduplication
     - Average line length < 100 tokens
     - Maximum line length < 1000 MB
     - Characters being decimal or hexadecimal digits >90%

datasets/{codeparrot.txt → codeparrot.md} RENAMED Viewed

@@ -1,9 +1,9 @@
-[CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
 - Exact match deduplication
 - Filtering:
   - Average line length < 100 tokens
   - Maximum line length < 1000 MB
-  - Alpha numeric characters fraction > 0.25
   - Remove auto-generated files (keyword search)
 For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).

+[CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
 - Exact match deduplication
 - Filtering:
   - Average line length < 100 tokens
   - Maximum line length < 1000 MB
+  - Alphanumeric characters fraction > 0.25
   - Remove auto-generated files (keyword search)
 For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).

datasets/{github_code.txt → github_code.md} RENAMED Viewed

File without changes

datasets/{incoder.txt → incoder.md} RENAMED Viewed

@@ -1,13 +1,14 @@
-[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via infilling. It was trained on **216 GB** of data, after preprocessing, from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
-The Github data used the following filtering:
 - Average line length < 100 tokens
 - Maximum line length < 3000 MB
 - Alphanumeric characters fraction > 0.4
 - Remove auto-generated files (keyword search)
-The second component of the data consists of questions, answers, and comments from StackOverflow, it includes:
 - all questions that have at least one answer
 - up to ten answers with a non-negative score (sorted by score) per question
 - up to five comments per question/answer
 Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).

+[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
+The GitHub data was cleaned with the following steps:
 - Average line length < 100 tokens
 - Maximum line length < 3000 MB
 - Alphanumeric characters fraction > 0.4
 - Remove auto-generated files (keyword search)
+The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
 - all questions that have at least one answer
 - up to ten answers with a non-negative score (sorted by score) per question
 - up to five comments per question/answer
 Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).

datasets/intro.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from platforms such as Stack Overflow. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
2	+
3	+ Some other useful datasets that are available on the 🤗 Hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from open-source libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.

datasets/intro.txt DELETED Viewed

@@ -1,3 +0,0 @@
-Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), it is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
-Some other useful datasets that are available on the 🤗 hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.

datasets/polycoder.md ADDED Viewed

	@@ -0,0 +1,5 @@

+The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
+- Exact match deduplication
+- Filtering:
+    - Average line length < 100 tokens
+    - Maximum line length < 1000 MB

datasets/polycoder.txt DELETED Viewed

@@ -1,5 +0,0 @@
-[PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
-- Exact match deduplication
-- Filtering:
-    - Average line length < 100 tokens
-    - Maximum line length < 1000 MB

evaluation/{demo_humaneval.txt → demo_humaneval.md} RENAMED Viewed

File without changes

evaluation/{intro.txt → intro.md} RENAMED Viewed

File without changes

generation/{intro.txt → intro.md} RENAMED Viewed

@@ -1,4 +1,4 @@
 In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
-* For CodeGen, there’s a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the 🤗 hub with 16.1B parameters, but we use the 6.1B version to have models of comparable size in this demo.
 * For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.

 In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
+* For CodeGen, there’s a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the 🤗 Hub with 16.1B parameters, but we use the 6.1B version to have models of comparable size in this demo.
 * For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.

utils/{intro.txt → intro.md} RENAMED Viewed

@@ -1,8 +1,8 @@
-This is an **interactive** blog, to give an overview of open-source language models for code generation. We present their code datasets, model architecture and model evaluation along with examples and tips to use the 🤗 hub for this task. At the end of this blog, you will find a **demo** to test and compare code generation across these models ✨.
 ## Introduction
 The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
-However, It can be easy to get lost between models, so at Hugging Face we aim to democratize ML and centralize all information in the 🤗 ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.

+This is an **interactive** blog that provides an overview of open-source language models for code generation. This post presents code datasets, model architectures and evaluations along with examples and tips to use the 🤗 Hub for this task. At the end of this blog, you will find a **demo** to test and compare code generation across these models directly in the browser! ✨.
 ## Introduction
 The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
+However, it can be easy to get lost between models. At Hugging Face we aim to democratize ML and centralize all information in the 🤗 ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the Hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.

utils/{resources.txt → resources.md} RENAMED Viewed

File without changes

utils/{table_contents.txt → table_contents.md} RENAMED Viewed

File without changes