High-level review
#2
by
osanseviero
- opened
- app.py +10 -10
- architectures/{codegen.txt β codegen.md} +6 -6
- architectures/{codeparrot.txt β codeparrot.md} +3 -3
- architectures/{incoder.txt β incoder.md} +3 -3
- architectures/{intro.txt β intro.md} +1 -1
- architectures/{polycoder.txt β polycoder.md} +1 -1
- datasets/{codegen.txt β codegen.md} +1 -4
- datasets/{codeparrot.txt β codeparrot.md} +2 -2
- datasets/{github_code.txt β github_code.md} +0 -0
- datasets/{incoder.txt β incoder.md} +4 -3
- datasets/intro.md +3 -0
- datasets/intro.txt +0 -3
- datasets/polycoder.md +5 -0
- datasets/polycoder.txt +0 -5
- evaluation/{demo_humaneval.txt β demo_humaneval.md} +0 -0
- evaluation/{intro.txt β intro.md} +0 -0
- generation/{intro.txt β intro.md} +1 -1
- utils/{intro.txt β intro.md} +2 -2
- utils/{resources.txt β resources.md} +0 -0
- utils/{table_contents.txt β table_contents.md} +0 -0
app.py
CHANGED
@@ -62,40 +62,40 @@ def generate_code_threads(
|
|
62 |
|
63 |
|
64 |
st.set_page_config(page_icon=":laptop:", layout="wide")
|
65 |
-
with open("utils/table_contents.
|
66 |
contents = f.read()
|
67 |
st.sidebar.markdown(contents)
|
68 |
|
69 |
# Introduction
|
70 |
st.title("Code generation with π€")
|
71 |
-
read_markdown("utils/intro.
|
72 |
|
73 |
# Code datasets
|
74 |
st.subheader("1 - Code datasets")
|
75 |
-
read_markdown("datasets/intro.
|
76 |
-
read_markdown("datasets/github_code.
|
77 |
col1, col2 = st.columns([1, 2])
|
78 |
with col1:
|
79 |
selected_model = st.selectbox("", MODELS, key=1)
|
80 |
-
read_markdown(f"datasets/{selected_model.lower()}.
|
81 |
|
82 |
|
83 |
# Model architecture
|
84 |
st.subheader("2 - Model architecture")
|
85 |
-
read_markdown("architectures/intro.
|
86 |
col1, col2 = st.columns([1, 2])
|
87 |
with col1:
|
88 |
selected_model = st.selectbox("", MODELS, key=2)
|
89 |
-
read_markdown(f"architectures/{selected_model.lower()}.
|
90 |
|
91 |
# Model evaluation
|
92 |
st.subheader("3 - Code models evaluation")
|
93 |
-
read_markdown("evaluation/intro.
|
94 |
-
read_markdown("evaluation/demo_humaneval.
|
95 |
|
96 |
# Code generation
|
97 |
st.subheader("4 - Code generation β¨")
|
98 |
-
read_markdown("generation/intro.
|
99 |
col1, col2, col3 = st.columns([7, 1, 6])
|
100 |
with col1:
|
101 |
st.markdown("**Models**")
|
|
|
62 |
|
63 |
|
64 |
st.set_page_config(page_icon=":laptop:", layout="wide")
|
65 |
+
with open("utils/table_contents.md", "r") as f:
|
66 |
contents = f.read()
|
67 |
st.sidebar.markdown(contents)
|
68 |
|
69 |
# Introduction
|
70 |
st.title("Code generation with π€")
|
71 |
+
read_markdown("utils/intro.md")
|
72 |
|
73 |
# Code datasets
|
74 |
st.subheader("1 - Code datasets")
|
75 |
+
read_markdown("datasets/intro.md")
|
76 |
+
read_markdown("datasets/github_code.md")
|
77 |
col1, col2 = st.columns([1, 2])
|
78 |
with col1:
|
79 |
selected_model = st.selectbox("", MODELS, key=1)
|
80 |
+
read_markdown(f"datasets/{selected_model.lower()}.md")
|
81 |
|
82 |
|
83 |
# Model architecture
|
84 |
st.subheader("2 - Model architecture")
|
85 |
+
read_markdown("architectures/intro.md")
|
86 |
col1, col2 = st.columns([1, 2])
|
87 |
with col1:
|
88 |
selected_model = st.selectbox("", MODELS, key=2)
|
89 |
+
read_markdown(f"architectures/{selected_model.lower()}.md")
|
90 |
|
91 |
# Model evaluation
|
92 |
st.subheader("3 - Code models evaluation")
|
93 |
+
read_markdown("evaluation/intro.md")
|
94 |
+
read_markdown("evaluation/demo_humaneval.md")
|
95 |
|
96 |
# Code generation
|
97 |
st.subheader("4 - Code generation β¨")
|
98 |
+
read_markdown("generation/intro.md")
|
99 |
col1, col2, col3 = st.columns([7, 1, 6])
|
100 |
with col1:
|
101 |
st.markdown("**Models**")
|
architectures/{codegen.txt β codegen.md}
RENAMED
@@ -1,18 +1,18 @@
|
|
1 |
-
|
2 |
|
3 |
<div align="center">
|
4 |
|
5 |
|Model | # parameters |
|
6 |
| - | - |
|
7 |
-
|
|
8 |
-
|
|
9 |
-
|
|
10 |
-
|
|
11 |
|
12 |
</div>
|
13 |
|
14 |
|
15 |
-
You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
|
16 |
|
17 |
```python
|
18 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
1 |
+
The CodeGen architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
|
2 |
|
3 |
<div align="center">
|
4 |
|
5 |
|Model | # parameters |
|
6 |
| - | - |
|
7 |
+
| [Salesforce/codegen-350m-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 350M |
|
8 |
+
| [Salesforce/codegen-2B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 2.7B |
|
9 |
+
| [Salesforce/codegen-6B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 6.1B |
|
10 |
+
| [Salesforce/codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 16.1B |
|
11 |
|
12 |
</div>
|
13 |
|
14 |
|
15 |
+
You can load the model and tokenizer directly from π€ [`transformers`](https://huggingface.co/docs/transformers/index):
|
16 |
|
17 |
```python
|
18 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
architectures/{codeparrot.txt β codeparrot.md}
RENAMED
@@ -1,11 +1,11 @@
|
|
1 |
-
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024.
|
2 |
|
3 |
<div align="center">
|
4 |
|
5 |
|Model | # parameters |
|
6 |
| - | - |
|
7 |
-
|
|
8 |
-
|
|
9 |
|
10 |
</div>
|
11 |
|
|
|
1 |
+
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. This model was released as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of π€ [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
|
2 |
|
3 |
<div align="center">
|
4 |
|
5 |
|Model | # parameters |
|
6 |
| - | - |
|
7 |
+
| [codeparrot-small](https://huggingface.co/lvwerra/codeparrot-small) | 110M |
|
8 |
+
| [codeparrot](https://huggingface.co/lvwerra/codeparrot) | 1.5B |
|
9 |
|
10 |
</div>
|
11 |
|
architectures/{incoder.txt β incoder.md}
RENAMED
@@ -3,8 +3,8 @@
|
|
3 |
|
4 |
|Model | # parameters |
|
5 |
| - | - |
|
6 |
-
|
|
7 |
-
|
|
8 |
|
9 |
</div>
|
10 |
|
@@ -17,7 +17,7 @@ During the training of InCoder, spans of code were randomly masked and moved to
|
|
17 |
|
18 |
So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
|
19 |
|
20 |
-
You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
|
21 |
|
22 |
```python
|
23 |
from transformers import AutoTokenizer, AutoModelWithLMHead
|
|
|
3 |
|
4 |
|Model | # parameters |
|
5 |
| - | - |
|
6 |
+
| [facebook/incoder-1B](https://huggingface.co/facebook/incoder-1B) |1.3B |
|
7 |
+
| [facebook/incoder-6B](https://huggingface.co/facebook/incoder-6B) |6.7B |
|
8 |
|
9 |
</div>
|
10 |
|
|
|
17 |
|
18 |
So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
|
19 |
|
20 |
+
You can load the model and tokenizer directly from π€ [`transformers`](https://huggingface.co/docs/transformers/index):
|
21 |
|
22 |
```python
|
23 |
from transformers import AutoTokenizer, AutoModelWithLMHead
|
architectures/{intro.txt β intro.md}
RENAMED
@@ -1,2 +1,2 @@
|
|
1 |
Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
|
2 |
-
that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about
|
|
|
1 |
Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
|
2 |
+
that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about each architecture, please select a model below:
|
architectures/{polycoder.txt β polycoder.md}
RENAMED
@@ -11,4 +11,4 @@
|
|
11 |
</div>
|
12 |
|
13 |
|
14 |
-
PolyCoder is currently being integrated in `transformers`. Meanwhile it can be loaded following the instructions in the original
|
|
|
11 |
</div>
|
12 |
|
13 |
|
14 |
+
PolyCoder is currently being integrated in π€ `transformers`. Meanwhile it can be loaded following the instructions in the original GitHub [repo](https://github.com/vhellendoorn/code-lms#models).
|
datasets/{codegen.txt β codegen.md}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
It was sequentially trained on three datasets:
|
4 |
- [The Pile](https://huggingface.co/datasets/the_pile)
|
5 |
- A 341GB subset of Googleβs [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
|
6 |
-
- 217GB of Python data from
|
7 |
|
8 |
The second and third datasets used the following preprocessing:
|
9 |
- Exact match deduplication
|
@@ -12,6 +12,3 @@ The second and third datasets used the following preprocessing:
|
|
12 |
- Average line length < 100 tokens
|
13 |
- Maximum line length < 1000 MB
|
14 |
- Characters being decimal or hexadecimal digits >90%
|
15 |
-
|
16 |
-
**Remark**:
|
17 |
-
The reported data sizes are after preprocessing.
|
|
|
3 |
It was sequentially trained on three datasets:
|
4 |
- [The Pile](https://huggingface.co/datasets/the_pile)
|
5 |
- A 341GB subset of Googleβs [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
|
6 |
+
- 217GB of Python data from GitHub repositories
|
7 |
|
8 |
The second and third datasets used the following preprocessing:
|
9 |
- Exact match deduplication
|
|
|
12 |
- Average line length < 100 tokens
|
13 |
- Maximum line length < 1000 MB
|
14 |
- Characters being decimal or hexadecimal digits >90%
|
|
|
|
|
|
datasets/{codeparrot.txt β codeparrot.md}
RENAMED
@@ -1,9 +1,9 @@
|
|
1 |
-
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of Python data
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|
5 |
- Maximum line length < 1000 MB
|
6 |
-
-
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|
9 |
For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
|
|
|
1 |
+
[CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
|
2 |
- Exact match deduplication
|
3 |
- Filtering:
|
4 |
- Average line length < 100 tokens
|
5 |
- Maximum line length < 1000 MB
|
6 |
+
- Alphanumeric characters fraction > 0.25
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|
9 |
For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
|
datasets/{github_code.txt β github_code.md}
RENAMED
File without changes
|
datasets/{incoder.txt β incoder.md}
RENAMED
@@ -1,13 +1,14 @@
|
|
1 |
-
[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via infilling. It was trained on **216 GB** of data
|
2 |
|
3 |
-
The
|
4 |
- Average line length < 100 tokens
|
5 |
- Maximum line length < 3000 MB
|
6 |
- Alphanumeric characters fraction > 0.4
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|
9 |
-
The second component of the data consists of questions, answers, and comments from
|
10 |
- all questions that have at least one answer
|
11 |
- up to ten answers with a non-negative score (sorted by score) per question
|
12 |
- up to five comments per question/answer
|
|
|
13 |
Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
|
|
|
1 |
+
[InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
|
2 |
|
3 |
+
The GitHub data was cleaned with the following steps:
|
4 |
- Average line length < 100 tokens
|
5 |
- Maximum line length < 3000 MB
|
6 |
- Alphanumeric characters fraction > 0.4
|
7 |
- Remove auto-generated files (keyword search)
|
8 |
|
9 |
+
The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
|
10 |
- all questions that have at least one answer
|
11 |
- up to ten answers with a non-negative score (sorted by score) per question
|
12 |
- up to five comments per question/answer
|
13 |
+
|
14 |
Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
|
datasets/intro.md
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from platforms such as Stack Overflow. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
|
2 |
+
|
3 |
+
Some other useful datasets that are available on the π€ Hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from open-source libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.
|
datasets/intro.txt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), it is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
|
2 |
-
|
3 |
-
Some other useful datasets that are available on the π€ hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.
|
|
|
|
|
|
|
|
datasets/polycoder.md
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
|
2 |
+
- Exact match deduplication
|
3 |
+
- Filtering:
|
4 |
+
- Average line length < 100 tokens
|
5 |
+
- Maximum line length < 1000 MB
|
datasets/polycoder.txt
DELETED
@@ -1,5 +0,0 @@
|
|
1 |
-
[PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
|
2 |
-
- Exact match deduplication
|
3 |
-
- Filtering:
|
4 |
-
- Average line length < 100 tokens
|
5 |
-
- Maximum line length < 1000 MB
|
|
|
|
|
|
|
|
|
|
|
|
evaluation/{demo_humaneval.txt β demo_humaneval.md}
RENAMED
File without changes
|
evaluation/{intro.txt β intro.md}
RENAMED
File without changes
|
generation/{intro.txt β intro.md}
RENAMED
@@ -1,4 +1,4 @@
|
|
1 |
In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
|
2 |
|
3 |
-
* For CodeGen, thereβs a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the π€
|
4 |
* For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.
|
|
|
1 |
In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
|
2 |
|
3 |
+
* For CodeGen, thereβs a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the π€ Hub with 16.1B parameters, but we use the 6.1B version to have models of comparable size in this demo.
|
4 |
* For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.
|
utils/{intro.txt β intro.md}
RENAMED
@@ -1,8 +1,8 @@
|
|
1 |
-
This is an **interactive** blog
|
2 |
|
3 |
|
4 |
## Introduction
|
5 |
|
6 |
The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
|
7 |
|
8 |
-
However,
|
|
|
1 |
+
This is an **interactive** blog that provides an overview of open-source language models for code generation. This post presents code datasets, model architectures and evaluations along with examples and tips to use the π€ Hub for this task. At the end of this blog, you will find a **demo** to test and compare code generation across these models directly in the browser! β¨.
|
2 |
|
3 |
|
4 |
## Introduction
|
5 |
|
6 |
The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
|
7 |
|
8 |
+
However, it can be easy to get lost between models. At Hugging Face we aim to democratize ML and centralize all information in the π€ ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the Hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.
|
utils/{resources.txt β resources.md}
RENAMED
File without changes
|
utils/{table_contents.txt β table_contents.md}
RENAMED
File without changes
|