innovation64 commited on
Commit
594b008
1 Parent(s): 78a866e

Upload 7 files

Browse files
Files changed (7) hide show
  1. .gitmodules +3 -0
  2. LICENSE +201 -0
  3. MODEL_LICENSE +33 -0
  4. README.md +221 -3
  5. README_zh.md +222 -0
  6. requirements.txt +12 -0
  7. setup.py +26 -0
.gitmodules ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ [submodule "vscode-extension/codegeex-vscode-extension"]
2
+ path = vscode-extension/codegeex-vscode-extension
3
+ url = git@github.com:CodeGeeX/codegeex-vscode-extension.git
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
MODEL_LICENSE ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The CodeGeeX License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means the CodeGeeX Model Team that distributes its Software.
6
+
7
+ “Software” means the CodeGeeX model parameters made available under this license.
8
+
9
+ 2. License Grant
10
+
11
+ Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
12
+
13
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
14
+
15
+ 3. Restriction
16
+
17
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
18
+
19
+ You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
20
+
21
+ 4. Disclaimer
22
+
23
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
24
+
25
+ 5. Limitation of Liability
26
+
27
+ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
28
+
29
+ 6. Dispute Resolution
30
+
31
+ This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
32
+
33
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at report@aminer.cn.
README.md CHANGED
@@ -1,3 +1,221 @@
1
- ---
2
- license: openrail
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="resources/logo/codegeex_logo.png">
2
+
3
+ <p align="center">
4
+ 🏠 <a href="https://codegeex.cn" target="_blank">Homepage</a> | 📖 <a href="https://models.aminer.cn/codegeex/blog/" target="_blank">Blog</a> | 🪧 <a href="https://models.aminer.cn/codegeex/playground" target="_blank">DEMO</a> | 🤖 <a href="https://models.aminer.cn/codegeex/download/request" target="_blank">Download Model</a> | 📄 <a href="https://arxiv.org/abs/2303.17568" target="_blank">Paper</a> | 🌐 <a href="README_zh.md" target="_blank">中文</a>
5
+ </p>
6
+ <p align="center">
7
+ 🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Jetbrains</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Cloud Studio</a> supported | 👋 Join our <a href="https://discord.gg/8gjHdkmAN6" target="_blank">Discord</a>, <a href="https://join.slack.com/t/codegeexworkspace/shared_invite/zt-1s118ffrp-mpKKhQD0tKBmzNZVCyEZLw" target="_blank">Slack</a>, <a href="https://t.me/+IipIayJ32B1jOTg1" target="_blank">Telegram</a>, <a href="https://wj.qq.com/s2/11274205/a15b/"target="_blank">WeChat</a>
8
+ </p>
9
+
10
+ <div align="center">
11
+
12
+ <a href="">[![Cloud Studio Template](https://cs-res.codehub.cn/common/assets/icon-badge.svg)](https://cloudstudio.net/templates/h0kvkZvoO0U)</a>
13
+
14
+ </div>
15
+
16
+ - [CodeGeeX: A Multilingual Code Generation Model](#codegeex-a-multilingual-code-generation-model)
17
+ - [News](#news)
18
+ - [Getting Started](#getting-started)
19
+ - [Installation](#installation)
20
+ - [Model Weights](#model-weights)
21
+ - [Inference on GPUs](#inference-on-gpus)
22
+ - [VS Code and Jetbrains Extension Guidance](#vs-code-and-jetbrains-extension-guidance)
23
+ - [CodeGeeX: Architecture, Code Corpus, and Implementation](#codegeex-architecture-code-corpus-and-implementation)
24
+ - [HumanEval-X: A new benchmark for Multilingual Program Synthesis](#humaneval-x-a-new-benchmark-for-multilingual-program-synthesis)
25
+ - [Multilingual Code Generation](#multilingual-code-generation)
26
+ - [Crosslingual Code Translation](#crosslingual-code-translation)
27
+ - [How to use HumanEval-X and contribute to it?](#how-to-use-humaneval-x-and-contribute-to-it)
28
+ - [License](#license)
29
+ - [Citation](#citation)
30
+
31
+ # CodeGeeX: A Multilingual Code Generation Model
32
+
33
+ We introduce CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. As of **June 22**, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 [Ascend 910 AI Processors](https://e.huawei.com/en/products/servers/ascend). CodeGeeX has several unique features:
34
+ * **Multilingual Code Generation**: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc. [DEMO](https://models.aminer.cn/codegeex)
35
+ * **Crosslingual Code Translation**: CodeGeeX supports the translation of code snippets between different languages. Simply by one click, CodeGeeX can transform a program into any expected language with a high accuracy. [DEMO](https://models.aminer.cn/codegeex/codeTranslator)
36
+ * **Customizable Programming Assistant**: CodeGeeX is available in the VS Code extension marketplace **for free**. It supports code completion, explanation, summarization and more, which empower users with a better coding experience. [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex)
37
+ * **Open-Source and Cross-Platform**: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100. [Apply Model Weights](https://models.aminer.cn/codegeex/download/request)
38
+
39
+ **HumanEval-X for Realistic Multilingual Benchmarking.** To help standardize the evaluation of multilingual code generation and translation, we develop and release the **HumanEval-X** Benchmark. HumanEval-X is a new multilingual benchmark that contains **820 human-crafted** coding problems in **5** programming languages (Python, C++, Java, JavaScript, and Go), each of these problems is associated with tests and solutions. [Usage](codegeex/benchmark/README.md) [🤗 Available in HuggingFace](https://huggingface.co/datasets/THUDM/humaneval-x)
40
+
41
+ <img src="resources/en/hx_boxplot.png">
42
+
43
+ <p align="center"><i>CodeGeeX achieves the highest average performance compared with other open-sourced multilingual baselines.</i> </p>
44
+
45
+ ## News
46
+
47
+ * **2023-03-30**: CodeGeeX paper is now available at [arxiv](https://arxiv.org/abs/2303.17568).
48
+
49
+ * **2023-02-14**: CodeGeeX now supports [Cloud Studio](https://cloudstudio.net/), a fantastic web IDE from Tencent. Click on the badge on top of this page to quickly launch an environment to test CodeGeeX.
50
+
51
+ * **2023-02-13**: Thanks a lot to [OneFlow](https://github.com/Oneflow-Inc/oneflow) team for adding oneflow backend for CodeGeeX's inference (Even faster than FasterTransformer under FP16!). Check more details [here](https://github.com/THUDM/CodeGeeX/pull/65).
52
+
53
+ * 🌟 **2023-02**: We are hosting [CodeGeeX "Coding With AI" Hackathon](https://dorahacks.io/hackathon/codegeex/), design cool applications based on CodeGeeX and win prizes (RTX 4090, DJI drone, etc)!
54
+
55
+ * **2022-12-31**: We release the FasterTransformer version of CodeGeeX in [codegeex-fastertransformer](https://github.com/CodeGeeX/codegeex-fastertransformer). The INT8 accelerated version reaches an a verage speed of <15ms/token. Happy new year to everyone!
56
+
57
+ * **2022-12-13**: We release the source code of CodeGeeX VS Code extension in [codegeex-vscode-extension](https://github.com/CodeGeeX/codegeex-vscode-extension). Follow [QuickStart](https://github.com/CodeGeeX/codegeex-vscode-extension/blob/main/doc/quickstart.md) to start development.
58
+
59
+ * **2022-12-11**: CodeGeeX is now available for Jetbrains IDEs (IntelliJ IDEA, PyCharm, GoLand, CLion, etc), download it [here](https://plugins.jetbrains.com/plugin/20587-codegeex).
60
+
61
+ * **2022-12-04**: We release source code of quantization (requires less GPU RAM: 27GB -> 15GB) and model parallelism (possible to run on multiple GPUs with <8G RAM).
62
+
63
+ * **2022-09-30**: We release the cross-platform source code and models weights for both Ascend and NVIDIA platforms.
64
+
65
+ ## Getting Started
66
+
67
+ CodeGeeX is initially implemented in Mindspore and trained Ascend 910 AI Processors. We provide a torch-compatible version based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) to facilitate usage on GPU platforms.
68
+ ### Installation
69
+
70
+ Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ are required. Install ``codegeex`` package via:
71
+ ```bash
72
+ git clone git@github.com:THUDM/CodeGeeX.git
73
+ cd CodeGeeX
74
+ pip install -e .
75
+ ```
76
+
77
+ ### Model Weights
78
+
79
+ Apply and download model weights through this [link](https://models.aminer.cn/codegeex/download/request). You'll receive by mail ```urls.txt``` that contains temporary download links. We recommend you to use [aria2](https://aria2.github.io/) to download it via the following command (Please make sure you have enough disk space to download the checkpoint (~26GB)):
80
+ ```bash
81
+ aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt
82
+ ```
83
+ Run the following command to get the full model weights:
84
+ ```bash
85
+ cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
86
+ tar xvf codegeex_13b.tar.gz
87
+ ```
88
+
89
+ ### Inference on GPUs
90
+
91
+ Have a try on generating the first program with CodeGeeX. First, specify the path of the model weights in ``configs/codegeex_13b.sh``. Second, write the prompt (natural language description or code snippet) into a file, e.g., ``tests/test_prompt.txt``, then run the following script:
92
+ ```bash
93
+ # On a single GPU (with more than 27GB RAM)
94
+ bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt
95
+
96
+ # With quantization (with more than 15GB RAM)
97
+ bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt
98
+
99
+ # On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions)
100
+ bash ./scripts/convert_ckpt_parallel.sh <LOAD_CKPT_PATH> <SAVE_CKPT_PATH> <MP_SIZE>
101
+ bash ./scripts/test_inference_parallel.sh <MP_SIZE> ./tests/test_prompt.txt
102
+ ```
103
+
104
+ ### VS Code and Jetbrains Extension Guidance
105
+
106
+ Based on CodeGeeX, we also develop free extentions for VS Code and Jetbrains IDEs, and more in the future.
107
+
108
+ For VS Code, search "codegeex" in Marketplace or install it [here](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex). Detailed instructions can be found in
109
+ [VS Code Extension Guidance](vscode-extension/README.md). For developers, we have also released the source code in [codegeex-vscode-extension](https://github.com/CodeGeeX/codegeex-vscode-extension), please follow [QuickStart](https://github.com/CodeGeeX/codegeex-vscode-extension/blob/main/doc/quickstart.md) to start development.
110
+
111
+ For Jetbrains IDEs, search "codegeex" in Plugins or intall it [here](https://plugins.jetbrains.com/plugin/20587-codegeex).
112
+ Make sure your IDE version is 2021.1 or later. CodeGeeX now supports IntelliJ IDEA, PyCharm, GoLand, CLion, Android Studio, AppCode, Aqua, DataSpell, DataGrip, Rider, RubyMine, and WebStorm.
113
+
114
+ ## CodeGeeX: Architecture, Code Corpus, and Implementation
115
+
116
+ **Architecture**: CodeGeeX is a large-scale pre-trained programming language model based on transformers. It is a left-to-right autoregressive decoder, which takes code and natural language as input and predicts the probability of the next token. CodeGeeX contains 40 transformer layers with a hidden size of 5,120 for self-attention blocks and 20,480 for feed-forward layers, making its size reach 13 billion parameters. It supports a maximum sequence length of 2,048.
117
+
118
+ <img src="resources/en/codegeex_training.png">
119
+ <p align="center"><i><b>Left:</b> the proportion of programming languages in CodeGeeX's training data.
120
+ <b>Right:</b> the plot of training loss against the training steps of CodeGeeX.</i></p>
121
+
122
+ **Code Corpus**: Our training data contains two parts. The first part is from open-sourced code datasets, [The Pile](https://pile.eleuther.ai/) and [CodeParrot](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot). The Pile contains a subset of code corpus that collects public repositories with more than 100 stars from GitHub, from which we select codes in 23 popular programming languages. The second part is supplementary data directly scrapped from the public GitHub repositories that do not appear in previous datasets, including Python, Java and C++. To obtain data of potentially higher quality, repositories with at least one star and its size smaller than 10MB are chosen. A file is filtered out if it 1) has more than 100 characters per line on average, 2) is automatically generated, 3) has a ratio of alphabet less than 40%, or 4) is bigger than 100KB or smaller than 1KB. To help the model distinguish different languages, we add a language-specific prefix at the beginning of each segment in the form of ``[Comment sign] language: [LANG]``, e.g., ``# language: Python``. For tokenization, we use the same tokenizer as GPT-2 and process whitespaces as extra tokens, resulting in a vocabulary of 50,400 tokens. In total, the code corpus has 23 programming languages with 158.7B tokens.
123
+
124
+ **Training**: We implement CodeGeeX in [Mindspore 1.7](https://www.mindspore.cn/) and train it on 1,536 Ascend 910 AI Processor (32GB). The model weights are under FP16 format, except that we use FP32 for layer-norm and softmax for higher precision and stability. The entire model consumes about 27GB of memory. To increase the training efficiency, we adopt an 8-way model parallel training together with 192-way data parallel training, with ZeRO-2 optimizer enabled. The micro-batch size is 16 and the global batch size reaches 3,072. Moreover, we adopt techniques to further boost the training efficiency including the element-wise operator fusion, fast gelu activation, matrix multiplication dimension optimization, etc. The entire training process takes nearly two months, spanning from April 18 to June 22, 2022, during which 850B tokens were passed for training, i.e., 5+ epochs.
125
+
126
+ ## HumanEval-X: A new benchmark for Multilingual Program Synthesis
127
+ To better evaluate the multilingual ability of code generation models, we propose a new benchmark HumanEval-X. While previous works evaluate multilingual program synthesis under semantic similarity (e.g., [CodeBLEU](https://arxiv.org/abs/2009.10297)) which is often misleading, HumanEval-X evaluates the functional correctness of the generated programs. HumanEval-X consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.
128
+
129
+ <img src="resources/en/hx_tasks.png">
130
+
131
+ <p align="center"><i>An illustration of tasks supported by <b>HumanEval-X</b>. Declarations, docstrings, and solutions are marked with red, green, and blue respectively. <b>Code generation</b> uses declaration and docstring as input, to generate solution. <b>Code translation</b> uses declaration in both languages and translate the solution in source language to the one in target language.</i></p>
132
+
133
+ In HumanEval-X, every sample in each language contains declaration, docstring, and solution, which can be combined in various ways to support different downstream tasks including generation, translation, summarization, etc. We currently focus on two tasks: **code generation** and **code translation**. For code generation, the model uses declaration and docstring as input to generate the solution. For code translation, the model uses declarations in both languages and the solution in the source language as input, to generate solutions in the target language. We remove the description during code translation to prevent the model from directly solving the problem. For both tasks, we use the unbiased pass@k metric proposed in [Codex](https://arxiv.org/abs/2107.03374): $\text{pass}@k:= \mathbb{E}[1-\frac{\tbinom{n-c}{k}}{\tbinom{n}{k}}]$, with $n=200$ and $k\in(1,10,100)$.
134
+
135
+ ### Multilingual Code Generation
136
+
137
+ <img src="resources/en/hx_generattion_radar_horizon.png">
138
+ <p align="center"><i><b>Left</b>: the detailed pass@k (k=1,10,100) performance on code generation task for five languages in HumanEval-X. <b>Right</b>: the average performance of all languages of each model. CodeGeeX achieves the highest average performance compared with InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B.</i></p>
139
+
140
+
141
+ We compare CodeGeeX with two other open-sourced code generation models, [InCoder](https://github.com/dpfried/incoder) (from Meta) and [CodeGen](https://github.com/salesforce/CodeGen) (from Salesforce). Specifically, InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B are considered. CodeGeeX significantly outperforms models with smaller scales (by 7.5%~16.3%) and is competitive with CodeGen-Multi-16B with a larger scale (average performance 54.76% vs. 54.39%). CodeGeeX achieves the best average performance across languages.
142
+
143
+ ### Crosslingual Code Translation
144
+
145
+ <img src="resources/en/hx_translation.png">
146
+
147
+ <p align="center"><i>Results on HumanEval-X <b>code translation</b> task. Best language-wise performance are <b>bolded</b>.</i></p>
148
+
149
+ We also evaluate the performance of translation across different programming languages. We test the zero-shot performance of CodeGeeX, as well as the fine-tuned CodeGeeX-13B-FT (fine-tuned using the training set of code translation tasks in [XLCoST](https://github.com/reddy-lab-code-research/XLCoST); Go is absent in the original set, we thus add a small set to it). The results indicate that models have a preference for languages, e.g., CodeGeeX is good at translating other languages to Python and C++, while CodeGen-Multi-16B is better at translating to JavaScript and Go; these could probably be due to the difference in language distribution in the training corpus. Among 20 translation pairs, we also observe that the performance of A-to-B and B-to-A are always negatively correlated, which might indicate that the current models are still not capable of learning all languages well.
150
+
151
+ ### How to use HumanEval-X and contribute to it?
152
+
153
+ For more details on how to use HumanEval-X, please see [usage](codegeex/benchmark/README.md). We highly welcome the community to contribute to HumanEval-X by adding more problems or extending it to other languages, please check out the [standard format](codegeex/benchmark/README.md#how-to-use-humaneval-x) of HumanEval-X and add a pull request.
154
+
155
+ Please kindly let us know if you have any comment or suggestion, via [codegeex@aminer.cn](mailto:codegeex@aminer.cn).
156
+
157
+ <details>
158
+ <summary><b>Examples of Generation</b></summary>
159
+ <img src="resources/en/hx_examples.png">
160
+ </details>
161
+
162
+ <details>
163
+ <summary><b>Acknowledgement</b></summary>
164
+ <br/>
165
+ This project is supported by the National Science Foundation for Distinguished Young Scholars (No. 61825602).
166
+
167
+ ### Lead Contributors
168
+
169
+ Qinkai Zheng ([Tsinghua KEG](http://keg.cs.tsinghua.edu.cn/glm-130b/)), Xiao Xia (Tsinghua KEG), Xu Zou (Tsinghua KEG)
170
+
171
+ ### Contributors
172
+
173
+ Tsinghua KEG---The Knowledge Engineering Group at Tsinghua: Aohan Zeng, Wendi Zheng, Lilong Xue
174
+
175
+ Zhilin Yang's Group at Tsinghua IIIS: Yifeng Liu, Yanru Chen, Yichen Xu (BUPT, work was done when visiting Tsinghua)
176
+
177
+ Peng Cheng Laboratory: Qingyu Chen, Zhongqi Li, Gaojun Fan
178
+
179
+ Zhipu\.AI: Yufei Xue, Shan Wang, Jiecai Shan, Haohan Jiang, Lu Liu, Xuan Xue, Peng Zhang
180
+
181
+ Ascend and Mindspore Team: Yifan Yao, Teng Su, Qihui Deng, Bin Zhou
182
+
183
+ ### Data Annotations
184
+
185
+ Ruijie Cheng (Tsinghua), Peinan Yu (Tsinghua), Jingyao Zhang (Zhipu\.AI), Bowen Huang (Zhipu\.AI), Shaoyu Wang (Zhipu\.AI)
186
+
187
+ ### Advisors
188
+
189
+ [Zhilin Yang](https://kimiyoung.github.io/) (Tsinghua IIIS), Yuxiao Dong (Tsinghua KEG), Wenguang Chen (Tsinghua PACMAN), Jie Tang (Tsinghua KEG)
190
+
191
+
192
+ ### Computation Sponsors
193
+
194
+ [Peng Cheng Laboratory](https://www.pcl.ac.cn/index.html)
195
+
196
+ [Zhipu.AI](https://www.zhipu.ai/)---an AI startup that aims to teach machines to think like humans
197
+
198
+ ### Project Leader
199
+
200
+ [Jie Tang](http://keg.cs.tsinghua.edu.cn/jietang/) (Tsinghua KEG & BAAI)
201
+ </details>
202
+
203
+ ## License
204
+
205
+ Our code is licensed under the [Apache-2.0 license](LICENSE).
206
+ Our model is licensed under the [license](MODEL_LICENSE).
207
+
208
+ ## Citation
209
+
210
+ If you find our work useful, please cite:
211
+
212
+ ```
213
+ @misc{zheng2023codegeex,
214
+ title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X},
215
+ author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
216
+ year={2023},
217
+ eprint={2303.17568},
218
+ archivePrefix={arXiv},
219
+ primaryClass={cs.LG}
220
+ }
221
+ ```
README_zh.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="resources/logo/codegeex_logo.png">
2
+
3
+ <p align="center">
4
+ 🏠 <a href="https://models.aminer.cn/codegeex/zh-CN" target="_blank">主页</a> | 📖 <a href="https://models.aminer.cn/codegeex/blog/index_zh.html" target="_blank">博客</a> | 🪧 <a href="https://models.aminer.cn/codegeex/zh-CN/playground" target="_blank">示例</a> | 🤖 <a href="https://models.aminer.cn/codegeex/download/request" target="_blank">模型下载</a> | 📄 <a href="https://arxiv.org/abs/2303.17568" target="_blank">论文</a> | 🌐 <a href="https://github.com/THUDM/CodeGeeX/blob/main/README.md" target="_blank">English</a>
5
+ </p>
6
+ <p align="center">
7
+ 🛠 <a href="https://marketplace.visualstudio.com/items?itemName=aminer.codegeex" target="_blank">VS Code</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Jetbrains</a>, <a href="https://plugins.jetbrains.com/plugin/20587-codegeex" target="_blank">Cloud Studio</a> 插件 | 👋 欢迎加入 <a href="https://wj.qq.com/s2/11274205/a15b/"target="_blank">微信开发者交流群</a>
8
+ </p>
9
+
10
+ <div align="center">
11
+
12
+ <a href="">[![Cloud Studio Template](https://cs-res.codehub.cn/common/assets/icon-badge.svg)](https://cloudstudio.net/templates/h0kvkZvoO0U)</a>
13
+
14
+ </div>
15
+
16
+ - [CodeGeeX: 多语言代码生成模型](#codegeex-多语言代码生成模型)
17
+ - [新闻](#新闻)
18
+ - [使用指南](#使用指南)
19
+ - [安装](#安装)
20
+ - [模型权重](#模型权重)
21
+ - [用GPU进行推理](#用gpu进行推理)
22
+ - [插件使用指南](#插件使用指南)
23
+ - [CodeGeeX: 多语言代码生成模型](#codegeex-多语言代码生成模型-1)
24
+ - [国产平台实现与训练](#国产平台实现与训练)
25
+ - [HumanEval-X: 多语言代码生成基准](#humaneval-x-多语言代码生成基准)
26
+ - [多语言代码生成](#多语言代码生成)
27
+ - [跨语言代码翻译](#跨语言代码翻译)
28
+ - [许可证](#许可证)
29
+ - [引用](#引用)
30
+ # CodeGeeX: 多语言代码生成模型
31
+
32
+ CodeGeeX是一个具有130亿参数的多编程语言代码生成预训练模型。CodeGeeX采用华为MindSpore框架实现,在鹏城实验室“鹏城云脑II”中的192个节点(共1536个国产[昇腾910 AI处理器](https://e.huawei.com/cn/products/servers/ascend))上训练而成。截至2022年6月22日,CodeGeeX历时两个月在20多种编程语言的代码语料库(>8500亿Token)上预训练得到。CodeGeeX有以下特点:
33
+ * **高精度代码生成**:支持生成Python、C++、Java、JavaScript和Go等多种主流编程语言的代码,在HumanEval-X代码生成任务上取得47%~60%求解率,较其他开源基线模型有更佳的平均性能。[代码生成示例](https://models.aminer.cn/codegeex/zh-CN)
34
+ * **跨语言代码翻译**:支持代码片段在不同编程语言间进行自动翻译转换,翻译结果正确率高,在HumanEval-X代码翻译任务上超越了其它基线模型。[代码翻译示例](https://models.aminer.cn/codegeex/zh-CN/codeTranslator)
35
+ * **自动编程插件**:CodeGeeX插件现已上架VSCode插件市场(完全免费),用户可以通过其强大的少样本生成能力,自定义代码生成风格和能力,更好辅助代码编写。[插件下载](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex)
36
+ * **模型跨平台开源**: 所有代码和模型权重开源开放,用作研究用途。CodeGeeX同时支持昇腾和英伟达平台,可在单张昇腾910或英伟达V100/A100上实现推理。[申请模型权重](https://models.aminer.cn/codegeex/download/request)
37
+
38
+ **全新多编程语言评测基准HumanEval-X**:HumanEval-X是第一个支持功能正确性评测的多语言、多任务的基准,包含820个人工编写的高质量代码生成题目、测试用例与参考答案,覆盖5种编程语言(Python、C++、Java、JavaScript、Go),支持代码生成与代码翻译能力的评测。[如何使用](codegeex/benchmark/README_zh.md)
39
+
40
+ <img src="resources/zh/hx_boxplot_zh.png">
41
+
42
+ <p align="center"><i>在HumanEval-X代码生成任务上,与其它开源基线模型相比,CodeGeeX取得了最佳的平均性能。</i> </p>
43
+
44
+ ## 新闻
45
+
46
+ * **2023-03-30**: CodeGeeX 已论文发表在[arxiv](https://arxiv.org/abs/2303.17568)。
47
+
48
+ * **2023-02-14**: CodeGeeX 现已支持 [Cloud Studio](https://cloudstudio.net/), 一款腾讯推出、十分好用的在线编辑器。单击此页面顶部的徽章可快速启动环境测试 CodeGeeX。
49
+
50
+ * **2023-02-13**: 感谢 [OneFlow](https://github.com/Oneflow-Inc/oneflow) 加入了oneflow版推理支持,在FP16下比FasterTransformer还要快!更多优化细节请点击[这里](https://github.com/THUDM/CodeGeeX/pull/65).
51
+
52
+ * 🌟 **2023-02**: [CodeGeeX "Coding With AI"黑客松](https://dorahacks.io/hackathon/codegeex/)正在进行中,为CodeGeeX设计应用并赢取奖品(RTX 4090、DJI无人机等)!
53
+
54
+ * **2022-12-31**: 我们在 [codegeex-fastertransformer](https://github.com/CodeGeeX/codegeex-fastertransformer) 中发布了 CodeGeeX 的 FasterTransformer 版本。INT8加速版本达到 <15ms/token 的平均速度。祝大家新年快乐!
55
+
56
+ * **2022-12-13**: 我们开源了VS Code插件源码:[codegeex-vscode-extension](https://github.com/CodeGeeX/codegeex-vscode-extension),参考 [QuickStart](https://github.com/CodeGeeX/codegeex-vscode-extension/blob/main/doc/quickstart.md) 开始开发吧!
57
+
58
+ * **2022-12-11**: CodeGeeX for Jetbrains IDEs已上线,支持IntelliJ IDEA, PyCharm, GoLand, CLion等,[点击下载](https://plugins.jetbrains.com/plugin/20587-codegeex)。
59
+
60
+ * **2022-12-04**: 我们开源了量化代码(需要更少的显存:27GB -> 15GB)以及模型并行代码(可以运行在多个显存至少8GB的GPUs上)。
61
+
62
+ * **2022-09-30**: 我们开源了跨平台代码和模型权重,同时支持昇腾和英伟达平台。
63
+ ## 使用指南
64
+
65
+ CodeGeeX最初使用Mindspore框架实现,并在昇腾910AI芯片上进行训练。为适配更多平台,我们将其转换到[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)框架,支持Pytorch+GPU环境。
66
+ ### 安装
67
+
68
+ 需要Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+,通过以下命令安装 ``codegeex``:
69
+ ```bash
70
+ git clone git@github.com:THUDM/CodeGeeX.git
71
+ cd CodeGeeX
72
+ pip install -e .
73
+ ```
74
+
75
+ ### 模型权重
76
+
77
+ 通过[该链接](https://models.aminer.cn/codegeex/download/request)申请权重,您将收到一个包含临时下载链接文件```urls.txt```的邮件。推荐使用[aria2](https://aria2.github.io/)通过以下命令快速下载(请保证有足够的硬盘空间存放权重(~26GB)):
78
+ ```bash
79
+ aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt
80
+ ```
81
+ 使用以下命令合并得到完整的权重:
82
+ ```bash
83
+ cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
84
+ tar xvf codegeex_13b.tar.gz
85
+ ```
86
+
87
+ ### 用GPU进行推理
88
+
89
+ 尝试使用CodeGeeX模型生成第一个程序吧!首先,在配置文件``configs/codegeex_13b.sh``中写明存放权重的路径。其次,将提示(可以是任意描述或代码片段)写入文件``tests/test_prompt.txt``,运行以下脚本即可开始推理(需指定GPU序号):
90
+ ```bash
91
+ # On a single GPU (with more than 27GB RAM)
92
+ bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt
93
+
94
+ # With quantization (with more than 15GB RAM)
95
+ bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt
96
+
97
+ # On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions)
98
+ bash ./scripts/convert_ckpt_parallel.sh <LOAD_CKPT_PATH> <SAVE_CKPT_PATH> <MP_SIZE>
99
+ bash ./scripts/test_inference_parallel.sh <MP_SIZE> ./tests/test_prompt.txt
100
+ ```
101
+
102
+ ### 插件使用指南
103
+
104
+ 基于CodeGeeX,我们开发了免费的插件,支持 VS Code 与 Jetbrains IDEs,未来会支持更多平台。
105
+
106
+ VS Code版本,在应用市场搜索“codegeex”或通过[该链接](https://marketplace.visualstudio.com/items?itemName=aminer.codegeex)安装。详细的使用指南在[CodeGeeX VS Code插件使用指南](vscode-extension/README_zh.md)。我们也开源了VS Code插件源码:[codegeex-vscode-extension](https://github.com/CodeGeeX/codegeex-vscode-extension),参考[QuickStart](https://github.com/CodeGeeX/codegeex-vscode-extension/blob/main/doc/quickstart_zh.md) 开始开发吧!
107
+
108
+ Jetbrains版本,在Plugins市场搜索“codegeex”或通过[该链接](https://plugins.jetbrains.com/plugin/20587-codegeex)安装。
109
+ 请确保IDE版本在2021.1或更高。CodeGeeX目前支持 IntelliJ IDEA, PyCharm, GoLand, CLion, Android Studio, AppCode, Aqua, DataSpell, DataGrip, Rider, RubyMine, WebStorm。
110
+
111
+ ## CodeGeeX: 多语言代码生成模型
112
+
113
+ **架构**:CodeGeeX是一个基于transformers的大规模预训练编程语言模型。它是一个从左到右生成的自回归解码器,将代码或自然语言标识符(token)作为输入,预测下一个标识符的概率分布。CodeGeeX含有40个transformer层,每层自注意力块的隐藏层维数为5120,前馈层维数为20480,总参数量为130亿。模型支持的最大序列长度为2048。
114
+
115
+ <img src="resources/en/codegeex_training.png">
116
+
117
+ <p align="center"><i><b>左侧:</b>CodeGeeX训练数据中各编程语言占比。
118
+ <b>右侧:</b>CodeGeeX训练损失函数随训练步数下降曲线。</i></p>
119
+
120
+ **语料**:CodeGeeX的训练语料由两部分组成。第一部分是开源代码数据集,[The Pile](https://pile.eleuther.ai/) 与 [CodeParrot](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot)。The Pile包含GitHub上拥有超过100颗星的一部分开源仓库,我们从中选取了23种编程语言的代码。第二部分是补充数据,直接从GitHub开源仓库中爬取Python、Java、C++代码;为了获取高质量数据,我们根据以下准则选取代码仓库:1)至少拥有1颗星;2)总大小<10MB;3)不在此前的开源代码数据集中。我们还去掉了符合下列任一条件的文件:1)平均每行长度大于100字符;2)由自动生成得到;3)含有的字母不足字母表内的40%;4)大于100KB或小于1KB。为了让模型区分不��语言,我们在每个样本的开头加上一个前缀,其形式为``[注释符] language: [语言]``,例如:``# language: Python``。我们使用与GPT-2相同的分词器,并将空格处理为特殊标识符,词表大小为50400。整个代码语料含有23种编程语言、总计1587亿个标识符(不含填充符)。
121
+
122
+ ### 国产平台实现与训练
123
+ 我们在[Mindspore 1.7](https://www.mindspore.cn/)框架上实现了CodeGeeX模型,并使用鹏城实验室的全国产计算平台上进行训练。具体来说,CodeGeeX使用了其一个计算集群中的1536个昇腾910 AI处理器(32GB)进行了两个月左右的训练(2022年4月18日至6月22日)。除了Layer-norm与Softmax使用FP32格式以获得更高的精度与稳定性,模型参数整体使用FP16格式,最终整个模型需要占用约27GB显存。为了增加训练效率,我们使用8路模型并行和192路数据并行的训练策略,微批大小为16、全局批大小为3072,并采用ZeRO-2优化器降低显存占用。
124
+
125
+ 在开发与训练过程中,我们和华为Mindspore团队合作,对MindSpore框架进行了部分优化,进而大幅度提升训练效率。比如,我们发现矩阵乘法的计算时间占比仅为22.9%,大量时间被用于各类其它算子,因此实现了一系列算子融合,包括单元素算子融合、层归一化算子融合、FastGelu与矩阵乘法融合、批量矩阵乘法与加法融合等;再比如我们还对矩阵乘法算子的维度实现自动搜索调优,使其搜索出效率最高的计算维度组合。这些优化为训练速度带来了显著提升,在同等GPU卡数规模下(128卡),昇腾910对CodeGeeX这一模型的训练效率从约为NVIDIA A100的16.7%提升至43%;在千卡规模下,昇腾910训练效率相比自身优化前提升近300%。使用优化后的软硬件训练时,CodeGeeX单日训练量可达到54.3B个标识符(含填充符),证明了国产深度学习平台与工具的快速迭代能力以及强大竞争力。
126
+
127
+ ## HumanEval-X: 多语言代码生成基准
128
+ 为了更好地评测代码生成模型的多语言生成能力,我们构建了一个新基准HumanEval-X。此前,多语言代码生成能力是基于语义相似度(比如[CodeBLEU](https://arxiv.org/abs/2009.10297))衡量的,具有一定误导性;HumanEval-X则可用于衡量生成代码的功能正确性。HumanEval-X包含820个高质量手写样本,覆盖Python、C++、Java、JavaScript、Go,可用于多种任务。
129
+
130
+ <img src="resources/zh/hx_tasks_zh.png">
131
+
132
+ <p align="center"><i><b>HumanEval-X</b>支持的任务示例。<font style='background-color:#F8CECC'>声明</font>、<font style='background-color:#D5E8D4'>描述</font>、<font style='background-color:#DAE8FC'>解答</font>分别用红、绿、蓝色标注。<i>代码生成</i>将声明与描述作为输入,输出解答。<i>代码翻译</i>将两种语言的声明与源语言的解答作为输入,输出目标语言的解答。</i></p>
133
+
134
+ HumanEval-X中每个语言的样本,包含了声明、描述和解答,它们之间的组合可以支持不同的下游任务,包括生成、翻译、概括等。我们目前关注两个任务:**代码生成**与**代码翻译**。对于代码生成任务,模型将函数声明与文档字符串作为输入,输出函数实现;对于代码翻译任务,模型将两种语言的函数声明与源语言的实现作为输入,输出目标语言上的实现。我们在代码翻译任务中不将文档字符串输入模型,以避免模型直接通过描述生成答案。在两种任务下,我们都采用[Codex](https://arxiv.org/abs/2107.03374)所使用的无偏pass@k指标,判断生成代码的功能正确性: $\text{pass}@k:= \mathbb{E}[1-\frac{\tbinom{n-c}{k}}{\tbinom{n}{k}}]$, $n=200$, $k\in(1,10,100)$.
135
+
136
+ ### 多语言代码生成
137
+
138
+ <img src="resources/zh/hx_generattion_radar_horizon_zh.png">
139
+
140
+ <p align="center"><i><b>左侧</b>: HumanEval-X中五种语言具体的pass@k(k=1,10,100)性能。<b>右侧</b>: 模型在所有语言上的平均性能。CodeGeeX的平均表现优于InCoder-6.7B和CodeGen-Multi-6B/16B。</i></p>
141
+
142
+
143
+ 我们将CodeGeeX与另外两个开源代码生成模型进行比较,分别为Meta的[InCoder](https://github.com/dpfried/incoder)与Salesforce的[CodeGen](https://github.com/salesforce/CodeGen),选取InCoder-6.7B、CodeGen-Multi-6B 与 CodeGen-Multi-16B。CodeGeeX能获得最佳的平均性能,显著超越了参数量更小的模型(7.5%~16.3%的提升),与参数量更大的模型CodeGen-Multi-16B表现相当(平均性能 54.76% vs. 54.39%)。
144
+
145
+ ### 跨语言代码翻译
146
+
147
+ <img src="resources/zh/hx_translation_zh.png">
148
+
149
+ <p align="center"><i>HumanEval-X上的<b>代码翻译</b>任务结果。<b>加粗</b>结果表示在每种语言pass@k上的最佳效果。</i></p>
150
+
151
+ 我们还评测了模型在多语言间代码翻译上的性能。对于CodeGeeX,我们评测了未经微调的CodeGeeX-13B与经过微调的CodeGeeX-13B-FT(使用[XLCoST](https://github.com/reddy-lab-code-research/XLCoST)中代码翻译任务的训练集与一部分Go语言数据微调)。如上表显示,模型对特定语言存在偏好,比如CodeGeeX擅长将其他语言翻译为Python与C++,而CodeGen-Multi-16B擅长翻译为JavaScript和Go,这可能是由于训练集中的语料占比存在差异。在20个翻译对中,我们还观察到两种语言互相翻译的表现常常是呈负相关的,这可能说明现有的模型还不足以学好所有的语言。
152
+
153
+
154
+
155
+ <details>
156
+ <summary><b>在线生成与翻译DEMO</b></summary>
157
+ <img src="resources/en/hx_examples.png">
158
+ 我们为上述两个任务开发了DEMO:<a href="https://models.aminer.cn/codegeex/zh-CN/playground" target="_blank">代码生成</a>和<a href="https://models.aminer.cn/codegeex/zh-CN/codeTranslator" target="_blank">代码翻译</a>,欢迎点击体验!
159
+ </details>
160
+
161
+ <details>
162
+ <summary><b>致谢</b></summary>
163
+ <br/>
164
+ 这一项目由国家自然科学基金杰出青年科学基金项目(No. 61825602)支持。
165
+
166
+
167
+ #### 学生负责人
168
+
169
+ 郑勤锴([清华大学知识工程实验室](https://keg.cs.tsinghua.edu.cn/glm-130b/zh/posts/glm-130b/)),夏箫(清华大学知识工程实验室),邹旭(清华大学知识工程实验室)
170
+
171
+ #### 技术贡献
172
+
173
+ 清华大学知识工程实验室:曾奥涵,郑问迪,薛理龙
174
+
175
+ 清华大学交叉信息学院:刘益枫,陈彦儒,徐奕辰(北邮大四访问清华期间研究工作)
176
+
177
+ 鹏城实验室:陈庆玉,李忠琦,范高俊
178
+
179
+ 智谱AI:薛宇飞,王山,陕杰才,姜皓瀚,刘璐,薛旋,张鹏
180
+
181
+ 华为昇腾团队:姚逸璠,苏腾,邓启辉,周斌
182
+
183
+ #### 数据标注
184
+ 程锐杰(清华大学),于沛楠(清华大学),张竞尧(智谱AI),黄铂文(智谱AI),王炤宇(智谱AI)
185
+
186
+ #### 指导教师
187
+
188
+ [杨植麟](https://kimiyoung.github.io/)(清华大学交叉信息学院),东昱晓(清华大学知识工程实验室),陈文光(清华大学PACMAN实验室),[唐杰](http://keg.cs.tsinghua.edu.cn/jietang/)(清华大学知识工程实验室)
189
+
190
+
191
+ #### 计算资源支持
192
+
193
+ [鹏城实验室](https://www.pcl.ac.cn/index.html)
194
+
195
+ [智谱AI](https://www.zhipu.ai/)
196
+
197
+ #### 项目总负责
198
+
199
+ [唐杰](http://keg.cs.tsinghua.edu.cn/jietang/)(清华大学知识工程实验室 & 北京智源人工智能研究院)
200
+ </details>
201
+
202
+ 如果遇到问题或有任何建议,欢迎通过邮件与我们联系[codegeex@aminer.cn](mailto:codegeex@aminer.cn).
203
+
204
+ ## 许可证
205
+
206
+ 代码使用[Apache-2.0许可证](LICENSE)
207
+ 模型使用[许可证](MODEL_LICENSE)
208
+
209
+ ## 引用
210
+
211
+ 如果觉得我们的工作有帮助,欢迎引用以下论文:
212
+
213
+ ```
214
+ @misc{zheng2023codegeex,
215
+ title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X},
216
+ author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
217
+ year={2023},
218
+ eprint={2303.17568},
219
+ archivePrefix={arXiv},
220
+ primaryClass={cs.LG}
221
+ }
222
+ ```
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fire>=0.4.0
2
+ ipython>=8.4.0
3
+ numpy>=1.22.0
4
+ pandas>=1.3.5
5
+ pyzmq>=23.2.1
6
+ regex>=2022.3.15
7
+ setuptools>=58.0.4
8
+ transformers>=4.22.0
9
+ torch>=1.10.0
10
+ tqdm>=4.63.0
11
+ cpm_kernels
12
+ deepspeed>0.6.1
setup.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="codegeex",
5
+ py_modules=["codegeex"],
6
+ version="1.0",
7
+ description="CodeGeeX: A Open Multilingual Code Generation Model.",
8
+ author="Qinkai Zheng",
9
+ packages=find_packages(),
10
+ install_requires=[
11
+ "fire>=0.4.0",
12
+ "ipython>=8.4.0",
13
+ "numpy>=1.22.0",
14
+ "pandas>=1.3.5",
15
+ "pyzmq>=23.2.1",
16
+ "regex>=2022.3.15",
17
+ "setuptools>=58.0.4",
18
+ "transformers>=4.22.0,<=4.24.0",
19
+ "tokenizers>=0.11.0,<=0.11.4",
20
+ "torch>=1.10.0",
21
+ "tqdm>=4.63.0",
22
+ "cpm_kernels",
23
+ "deepspeed>0.6.1",
24
+ ],
25
+ entry_points={}
26
+ )