loubnabnl HF staff osanseviero HF staff commited on
Commit
1e77c56
β€’
1 Parent(s): 73353ee

High-level review (#2)

Browse files

- Review blog post (c26d8af61097bd77ce5471ffaee24ccc762e9a70)


Co-authored-by: Omar Sanseviero <osanseviero@users.noreply.huggingface.co>

app.py CHANGED
@@ -62,40 +62,40 @@ def generate_code_threads(
62
 
63
 
64
  st.set_page_config(page_icon=":laptop:", layout="wide")
65
- with open("utils/table_contents.txt", "r") as f:
66
  contents = f.read()
67
  st.sidebar.markdown(contents)
68
 
69
  # Introduction
70
  st.title("Code generation with πŸ€—")
71
- read_markdown("utils/intro.txt")
72
 
73
  # Code datasets
74
  st.subheader("1 - Code datasets")
75
- read_markdown("datasets/intro.txt")
76
- read_markdown("datasets/github_code.txt")
77
  col1, col2 = st.columns([1, 2])
78
  with col1:
79
  selected_model = st.selectbox("", MODELS, key=1)
80
- read_markdown(f"datasets/{selected_model.lower()}.txt")
81
 
82
 
83
  # Model architecture
84
  st.subheader("2 - Model architecture")
85
- read_markdown("architectures/intro.txt")
86
  col1, col2 = st.columns([1, 2])
87
  with col1:
88
  selected_model = st.selectbox("", MODELS, key=2)
89
- read_markdown(f"architectures/{selected_model.lower()}.txt")
90
 
91
  # Model evaluation
92
  st.subheader("3 - Code models evaluation")
93
- read_markdown("evaluation/intro.txt")
94
- read_markdown("evaluation/demo_humaneval.txt")
95
 
96
  # Code generation
97
  st.subheader("4 - Code generation ✨")
98
- read_markdown("generation/intro.txt")
99
  col1, col2, col3 = st.columns([7, 1, 6])
100
  with col1:
101
  st.markdown("**Models**")
62
 
63
 
64
  st.set_page_config(page_icon=":laptop:", layout="wide")
65
+ with open("utils/table_contents.md", "r") as f:
66
  contents = f.read()
67
  st.sidebar.markdown(contents)
68
 
69
  # Introduction
70
  st.title("Code generation with πŸ€—")
71
+ read_markdown("utils/intro.md")
72
 
73
  # Code datasets
74
  st.subheader("1 - Code datasets")
75
+ read_markdown("datasets/intro.md")
76
+ read_markdown("datasets/github_code.md")
77
  col1, col2 = st.columns([1, 2])
78
  with col1:
79
  selected_model = st.selectbox("", MODELS, key=1)
80
+ read_markdown(f"datasets/{selected_model.lower()}.md")
81
 
82
 
83
  # Model architecture
84
  st.subheader("2 - Model architecture")
85
+ read_markdown("architectures/intro.md")
86
  col1, col2 = st.columns([1, 2])
87
  with col1:
88
  selected_model = st.selectbox("", MODELS, key=2)
89
+ read_markdown(f"architectures/{selected_model.lower()}.md")
90
 
91
  # Model evaluation
92
  st.subheader("3 - Code models evaluation")
93
+ read_markdown("evaluation/intro.md")
94
+ read_markdown("evaluation/demo_humaneval.md")
95
 
96
  # Code generation
97
  st.subheader("4 - Code generation ✨")
98
+ read_markdown("generation/intro.md")
99
  col1, col2, col3 = st.columns([7, 1, 6])
100
  with col1:
101
  st.markdown("**Models**")
architectures/{codegen.txt β†’ codegen.md} RENAMED
@@ -1,18 +1,18 @@
1
- [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono) architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
2
 
3
  <div align="center">
4
 
5
  |Model | # parameters |
6
  | - | - |
7
- | Decoder | 350M |
8
- | Decoder | 2.7B |
9
- | Decoder | 6.1B |
10
- | Decoder | 16.1B |
11
 
12
  </div>
13
 
14
 
15
- You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
16
 
17
  ```python
18
  from transformers import AutoTokenizer, AutoModelForCausalLM
1
+ The CodeGen architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
2
 
3
  <div align="center">
4
 
5
  |Model | # parameters |
6
  | - | - |
7
+ | [Salesforce/codegen-350m-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 350M |
8
+ | [Salesforce/codegen-2B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 2.7B |
9
+ | [Salesforce/codegen-6B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 6.1B |
10
+ | [Salesforce/codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 16.1B |
11
 
12
  </div>
13
 
14
 
15
+ You can load the model and tokenizer directly from πŸ€— [`transformers`](https://huggingface.co/docs/transformers/index):
16
 
17
  ```python
18
  from transformers import AutoTokenizer, AutoModelForCausalLM
architectures/{codeparrot.txt β†’ codeparrot.md} RENAMED
@@ -1,11 +1,11 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. We released this model as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of πŸ€— [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
2
 
3
  <div align="center">
4
 
5
  |Model | # parameters |
6
  | - | - |
7
- | GPT2 | 110M |
8
- | GPT2 | 1.5B |
9
 
10
  </div>
11
 
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. This model was released as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of πŸ€— [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
2
 
3
  <div align="center">
4
 
5
  |Model | # parameters |
6
  | - | - |
7
+ | [codeparrot-small](https://huggingface.co/lvwerra/codeparrot-small) | 110M |
8
+ | [codeparrot](https://huggingface.co/lvwerra/codeparrot) | 1.5B |
9
 
10
  </div>
11
 
architectures/{incoder.txt β†’ incoder.md} RENAMED
@@ -3,8 +3,8 @@
3
 
4
  |Model | # parameters |
5
  | - | - |
6
- | Decoder |1.3B |
7
- | Decoder |6.7B |
8
 
9
  </div>
10
 
@@ -17,7 +17,7 @@ During the training of InCoder, spans of code were randomly masked and moved to
17
 
18
  So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
19
 
20
- You can load the model and tokenizer directly from [`transformers`](https://huggingface.co/docs/transformers/index):
21
 
22
  ```python
23
  from transformers import AutoTokenizer, AutoModelWithLMHead
3
 
4
  |Model | # parameters |
5
  | - | - |
6
+ | [facebook/incoder-1B](https://huggingface.co/facebook/incoder-1B) |1.3B |
7
+ | [facebook/incoder-6B](https://huggingface.co/facebook/incoder-6B) |6.7B |
8
 
9
  </div>
10
 
17
 
18
  So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
19
 
20
+ You can load the model and tokenizer directly from πŸ€— [`transformers`](https://huggingface.co/docs/transformers/index):
21
 
22
  ```python
23
  from transformers import AutoTokenizer, AutoModelWithLMHead
architectures/{intro.txt β†’ intro.md} RENAMED
@@ -1,2 +1,2 @@
1
  Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
2
- that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about the architecture, please select a model below:
1
  Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
2
+ that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture. For model-specific information about each architecture, please select a model below:
architectures/{polycoder.txt β†’ polycoder.md} RENAMED
@@ -11,4 +11,4 @@
11
  </div>
12
 
13
 
14
- PolyCoder is currently being integrated in `transformers`. Meanwhile it can be loaded following the instructions in the original Github [repo](https://github.com/vhellendoorn/code-lms#models).
11
  </div>
12
 
13
 
14
+ PolyCoder is currently being integrated in πŸ€— `transformers`. Meanwhile it can be loaded following the instructions in the original GitHub [repo](https://github.com/vhellendoorn/code-lms#models).
datasets/{codegen.txt β†’ codegen.md} RENAMED
@@ -3,7 +3,7 @@
3
  It was sequentially trained on three datasets:
4
  - [The Pile](https://huggingface.co/datasets/the_pile)
5
  - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
- - 217GB of Python data from Github repositories
7
 
8
  The second and third datasets used the following preprocessing:
9
  - Exact match deduplication
@@ -12,6 +12,3 @@ The second and third datasets used the following preprocessing:
12
  - Average line length < 100 tokens
13
  - Maximum line length < 1000 MB
14
  - Characters being decimal or hexadecimal digits >90%
15
-
16
- **Remark**:
17
- The reported data sizes are after preprocessing.
3
  It was sequentially trained on three datasets:
4
  - [The Pile](https://huggingface.co/datasets/the_pile)
5
  - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
+ - 217GB of Python data from GitHub repositories
7
 
8
  The second and third datasets used the following preprocessing:
9
  - Exact match deduplication
12
  - Average line length < 100 tokens
13
  - Maximum line length < 1000 MB
14
  - Characters being decimal or hexadecimal digits >90%
 
 
 
datasets/{codeparrot.txt β†’ codeparrot.md} RENAMED
@@ -1,9 +1,9 @@
1
- [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of Python data, after preprocessing, from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens
5
  - Maximum line length < 1000 MB
6
- - Alpha numeric characters fraction > 0.25
7
  - Remove auto-generated files (keyword search)
8
 
9
  For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
  - Exact match deduplication
3
  - Filtering:
4
  - Average line length < 100 tokens
5
  - Maximum line length < 1000 MB
6
+ - Alphanumeric characters fraction > 0.25
7
  - Remove auto-generated files (keyword search)
8
 
9
  For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
datasets/{github_code.txt β†’ github_code.md} RENAMED
File without changes
datasets/{incoder.txt β†’ incoder.md} RENAMED
@@ -1,13 +1,14 @@
1
- [InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via infilling. It was trained on **216 GB** of data, after preprocessing, from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
 
3
- The Github data used the following filtering:
4
  - Average line length < 100 tokens
5
  - Maximum line length < 3000 MB
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8
 
9
- The second component of the data consists of questions, answers, and comments from StackOverflow, it includes:
10
  - all questions that have at least one answer
11
  - up to ten answers with a non-negative score (sorted by score) per question
12
  - up to five comments per question/answer
 
13
  Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
1
+ [InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
 
3
+ The GitHub data was cleaned with the following steps:
4
  - Average line length < 100 tokens
5
  - Maximum line length < 3000 MB
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8
 
9
+ The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
10
  - all questions that have at least one answer
11
  - up to ten answers with a non-negative score (sorted by score) per question
12
  - up to five comments per question/answer
13
+
14
  Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
datasets/intro.md ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from platforms such as Stack Overflow. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
2
+
3
+ Some other useful datasets that are available on the πŸ€— Hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from open-source libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.
datasets/intro.txt DELETED
@@ -1,3 +0,0 @@
1
- Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from Stackoverflow for example. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), it is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
2
-
3
- Some other useful datasets that are available on the πŸ€— hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from opensource libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later.
 
 
 
datasets/polycoder.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
1
+ The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
+ - Exact match deduplication
3
+ - Filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 1000 MB
datasets/polycoder.txt DELETED
@@ -1,5 +0,0 @@
1
- [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **254GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
- - Exact match deduplication
3
- - Filtering:
4
- - Average line length < 100 tokens
5
- - Maximum line length < 1000 MB
 
 
 
 
 
evaluation/{demo_humaneval.txt β†’ demo_humaneval.md} RENAMED
File without changes
evaluation/{intro.txt β†’ intro.md} RENAMED
File without changes
generation/{intro.txt β†’ intro.md} RENAMED
@@ -1,4 +1,4 @@
1
  In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
2
 
3
- * For CodeGen, there’s a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the πŸ€— hub with 16.1B parameters, but we use the 6.1B version to have models of comparable size in this demo.
4
  * For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.
1
  In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 6.7B and CodeGen 6.1B.
2
 
3
+ * For CodeGen, there’s a larger [model](https://huggingface.co/Salesforce/codegen-16B-mono) available on the πŸ€— Hub with 16.1B parameters, but we use the 6.1B version to have models of comparable size in this demo.
4
  * For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.
utils/{intro.txt β†’ intro.md} RENAMED
@@ -1,8 +1,8 @@
1
- This is an **interactive** blog, to give an overview of open-source language models for code generation. We present their code datasets, model architecture and model evaluation along with examples and tips to use the πŸ€— hub for this task. At the end of this blog, you will find a **demo** to test and compare code generation across these models ✨.
2
 
3
 
4
  ## Introduction
5
 
6
  The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
7
 
8
- However, It can be easy to get lost between models, so at Hugging Face we aim to democratize ML and centralize all information in the πŸ€— ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.
1
+ This is an **interactive** blog that provides an overview of open-source language models for code generation. This post presents code datasets, model architectures and evaluations along with examples and tips to use the πŸ€— Hub for this task. At the end of this blog, you will find a **demo** to test and compare code generation across these models directly in the browser! ✨.
2
 
3
 
4
  ## Introduction
5
 
6
  The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
7
 
8
+ However, it can be easy to get lost between models. At Hugging Face we aim to democratize ML and centralize all information in the πŸ€— ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the Hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.
utils/{resources.txt β†’ resources.md} RENAMED
File without changes
utils/{table_contents.txt β†’ table_contents.md} RENAMED
File without changes