ChadWong commited on
Commit
6cd54d6
1 Parent(s): dd03481

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ resources/long-context.png filter=lfs diff=lfs merge=lfs -text
37
+ resources/web-demo.gif filter=lfs diff=lfs merge=lfs -text
.github/ISSUE_TEMPLATE/bug_report.yaml ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: 🐞 Bug/Help
2
+ description: File a bug/issue
3
+ title: "[BUG/Help] <title>"
4
+ labels: []
5
+ body:
6
+ - type: checkboxes
7
+ attributes:
8
+ label: Is there an existing issue for this?
9
+ description: Please search to see if an issue already exists for the bug you encountered.
10
+ options:
11
+ - label: I have searched the existing issues
12
+ required: true
13
+ - type: textarea
14
+ attributes:
15
+ label: Current Behavior
16
+ description: |
17
+ A concise description of what you're experiencing, with screenshot attached if possible.
18
+ Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
19
+ validations:
20
+ required: true
21
+ - type: textarea
22
+ attributes:
23
+ label: Expected Behavior
24
+ description: A concise description of what you expected to happen.
25
+ validations:
26
+ required: false
27
+ - type: textarea
28
+ attributes:
29
+ label: Steps To Reproduce
30
+ description: Steps to reproduce the behavior.
31
+ placeholder: |
32
+ 1. In this environment...
33
+ 2. With this config...
34
+ 3. Run '...'
35
+ 4. See error...
36
+ validations:
37
+ required: true
38
+ - type: textarea
39
+ attributes:
40
+ label: Environment
41
+ description: |
42
+ examples:
43
+ - **OS**: Ubuntu 20.04
44
+ - **Python**: 3.8
45
+ - **Transformers**: 4.26.1
46
+ - **PyTorch**: 1.12
47
+ - **CUDA Support**: True
48
+ value: |
49
+ - OS:
50
+ - Python:
51
+ - Transformers:
52
+ - PyTorch:
53
+ - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
54
+ render: markdown
55
+ validations:
56
+ required: true
57
+ - type: textarea
58
+ attributes:
59
+ label: Anything else?
60
+ description: |
61
+ Links? References? Anything that will give us more context about the issue you are encountering!
62
+ validations:
63
+ required: false
.github/ISSUE_TEMPLATE/config.yml ADDED
@@ -0,0 +1 @@
 
 
1
+ blank_issues_enabled: false
.github/ISSUE_TEMPLATE/feature_request.yml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Feature request
2
+ description: Suggest an idea for this project
3
+ title: "[Feature] <title>"
4
+ labels: []
5
+ body:
6
+ - type: textarea
7
+ attributes:
8
+ label: Is your feature request related to a problem? Please describe.
9
+ description: |
10
+ A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
11
+ validations:
12
+ required: false
13
+ - type: textarea
14
+ attributes:
15
+ label: Solutions
16
+ description: |
17
+ Describe the solution you'd like
18
+ A clear and concise description of what you want to happen.
19
+ validations:
20
+ required: true
21
+ - type: textarea
22
+ attributes:
23
+ label: Additional context
24
+ description: Add any other context or screenshots about the feature request here.
25
+ validations:
26
+ required: false
FAQ.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Q1
2
+
3
+ **Mac直接加载量化后的模型出现提示 `clang: error: unsupported option '-fopenmp'**
4
+
5
+ 这是由于Mac由于本身缺乏omp导致的,此时可运行但是单核。需要单独安装 openmp 依赖,即可在Mac下使用OMP:
6
+
7
+ ```bash
8
+ # 参考`https://mac.r-project.org/openmp/`
9
+ ## 假设: gcc(clang)是14.x版本,其他版本见R-Project提供的表格
10
+ curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
11
+ sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
12
+ ```
13
+ 此时会安装下面几个文件:`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。
14
+
15
+ > 注意:如果你之前运行`ChatGLM2-6B`项目失败过,最好清一下Hugging Face的缓存,i.e. 默认下是 `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。
MODEL_LICENSE ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The ChatGLM-6B License
2
+
3
+ 一、定义
4
+
5
+ “许可方”是指分发其软件的 ChatGLM2-6B 模型团队。
6
+
7
+ “软件”是指根据本许可提供的 ChatGLM2-6B 模型参数。
8
+
9
+ 2. 许可授予
10
+
11
+ 根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可,仅用于您的非商业研究目的。
12
+
13
+ 上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
14
+
15
+ 3.限制
16
+
17
+ 您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
18
+
19
+ 您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
20
+
21
+ 4.免责声明
22
+
23
+ 本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
24
+
25
+ 5. 责任限制
26
+
27
+ 除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
28
+
29
+ 6.争议解决
30
+
31
+ 本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
32
+
33
+ 请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 glm-130b@googlegroups.com 与我们联系。
34
+
35
+ 1. Definitions
36
+
37
+ “Licensor” means the ChatGLM2-6B Model Team that distributes its Software.
38
+
39
+ “Software” means the ChatGLM2-6B model parameters made available under this license.
40
+
41
+ 2. License Grant
42
+
43
+ Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
44
+
45
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
46
+
47
+ 3. Restriction
48
+
49
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
50
+
51
+ You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
52
+
53
+ 4. Disclaimer
54
+
55
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
56
+
57
+ 5. Limitation of Liability
58
+
59
+ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
60
+
61
+ 6. Dispute Resolution
62
+
63
+ This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
64
+
65
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.
README.md CHANGED
@@ -1,12 +1,349 @@
1
  ---
2
- title: CharttGLM2 6B
3
- emoji: 🌖
4
- colorFrom: purple
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 3.35.2
8
- app_file: app.py
9
- pinned: false
10
  ---
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CharttGLM2-6B
3
+ app_file: web_demo.py
 
 
4
  sdk: gradio
5
  sdk_version: 3.35.2
 
 
6
  ---
7
+ # ChatGLM2-6B
8
 
9
+ <p align="center">
10
+ 🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
11
+ </p>
12
+ <p align="center">
13
+ 👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> 和 <a href="resources/WECHAT.md" target="_blank">WeChat</a>
14
+ </p>
15
+
16
+ *Read this in [English](README_EN.md)*
17
+
18
+ ## 介绍
19
+
20
+ ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM**2**-6B 引入了如下新特性:
21
+
22
+ 1. **更强大的性能**:基于 ChatGLM 初代模型的开发经验,我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 [GLM](https://github.com/THUDM/GLM) 的混合目标函数,经过了 1.4T 中英标识符的预训练与人类偏好对齐训练,[评测结果](#评测结果)显示,相比于初代模型,ChatGLM2-6B 在 MMLU(+23%)、CEval(+33%)、GSM8K(+571%) 、BBH(+60%)等数据集上的性能取得了大幅度的提升,在同尺寸开源模型中具有较强的竞争力。
23
+ 2. **更长的上下文**:基于 [FlashAttention](https://github.com/HazyResearch/flash-attention) 技术,我们将基座模型的上下文长度(Context Length)由 ChatGLM-6B 的 2K 扩展到了 32K,并在对话阶段使用 8K 的上下文长度训练,允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限,我们会在后续迭代升级中着重进行优化。
24
+ 3. **更高效的推理**:基于 [Multi-Query Attention](http://arxiv.org/abs/1911.02150) 技术,ChatGLM2-6B 有更高效的推理速度和更低的显存占用:在官方的模型实现下,推理速度相比初代提升了 42%,INT4 量化下,6G 显存支持的对话长度由 1K 提升到了 8K。
25
+ 4. **更开放的协议**:ChatGLM2-6B 权重对学术研究**完全开放**,在获得官方的书面许可后,亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。
26
+
27
+ -----
28
+
29
+ ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展,恳请开发者和大家遵守[开源协议](MODEL_LICENSE),勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。**目前,本项目团队未基于 ChatGLM2-6B 开发任何应用,包括网页端、安卓、苹果 iOS 及 Windows App 等应用。**
30
+
31
+ 尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性,但由于 ChatGLM2-6B 模型规模较小,且模型受概率随机性因素影响,无法保证输出内容的准确性,且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
32
+
33
+ ## 更新信息
34
+ **[2023/07/04]** 发布 P-Tuning v2 与 全参数微调脚本,参见 [P-Tuning](./ptuning)。
35
+
36
+ ## 友情链接
37
+ 对 ChatGLM2 进行加速的开源项目:
38
+ * [fastllm](https://github.com/ztxz16/fastllm/): 全平台加速推理方案,单GPU批量推理每秒可达10000+token,手机端最低3G内存实时运行(骁龙865上约4~5 token/s)
39
+ * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的 CPU 量化加速推理方案,实现 Mac 笔记本上实时对话
40
+
41
+ ## 评测结果
42
+ 我们选取了部分中英文典型数据集进行了评测,以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)(中文)、[GSM8K](https://github.com/openai/grade-school-math)(数学)、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)(英文) 上的测评结果。在 [evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。
43
+
44
+ ### MMLU
45
+
46
+ | Model | Average | STEM | Social Sciences | Humanities | Others |
47
+ | ----- | ----- | ---- | ----- | ----- | ----- |
48
+ | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
49
+ | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
50
+ | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
51
+
52
+ > Chat 模型使用 zero-shot CoT (Chain-of-Thought) 的方法测试,Base 模型使用 few-shot answer-only 的方法测试
53
+
54
+ ### C-Eval
55
+
56
+ | Model | Average | STEM | Social Sciences | Humanities | Others |
57
+ | ----- | ---- | ---- | ----- | ----- | ----- |
58
+ | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
59
+ | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
60
+ | ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
61
+
62
+ > Chat 模型使用 zero-shot CoT 的方法测试,Base 模型使用 few-shot answer only 的方法测试
63
+
64
+ ### GSM8K
65
+
66
+ | Model | Accuracy | Accuracy (Chinese)* |
67
+ | ----- | ----- | ----- |
68
+ | ChatGLM-6B | 4.82 | 5.85 |
69
+ | ChatGLM2-6B (base) | 32.37 | 28.95 |
70
+ | ChatGLM2-6B | 28.05 | 20.45 |
71
+
72
+ > 所有模型均使用 few-shot CoT 的方法测试,CoT prompt 来自 http://arxiv.org/abs/2201.11903
73
+ >
74
+ > \* 我们使用翻译 API 翻译了 GSM8K 中的 500 道题目和 CoT prompt 并进行了人工校对
75
+
76
+
77
+ ### BBH
78
+
79
+ | Model | Accuracy |
80
+ | ----- | ----- |
81
+ | ChatGLM-6B | 18.73 |
82
+ | ChatGLM2-6B (base) | 33.68 |
83
+ | ChatGLM2-6B | 30.00 |
84
+
85
+ > 所有模型均使用 few-shot CoT 的方法测试,CoT prompt 来自 https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
86
+
87
+ ## 推理性能
88
+ ChatGLM2-6B 使用了 [Multi-Query Attention](http://arxiv.org/abs/1911.02150),提高了生成速度。生成 2000 个字符的平均速度对比如下
89
+
90
+ | Model | 推理速度 (字符/秒) |
91
+ | ---- | ----- |
92
+ | ChatGLM-6B | 31.49 |
93
+ | ChatGLM2-6B | 44.62 |
94
+
95
+ > 使用官方实现,batch size = 1,max length = 2048,bf16 精度,测试硬件为 A100-SXM4-80G,软件环境为 PyTorch 2.0.1
96
+
97
+ Multi-Query Attention 同时也降低了生成过程中 KV Cache 的显存占用,此外,ChatGLM2-6B 采用 Causal Mask 进行对话训练,连续对话时可复用前面轮次的 KV Cache,进一步优化了显存占用。因此,使用 6GB 显存的显卡进行 INT4 量化的推理时,初代的 ChatGLM-6B 模型最多能够生成 1119 个字符就会提示显存耗尽,而 ChatGLM2-6B 能够生成至少 8192 个字符。
98
+
99
+ | **量化等级** | **编码 2048 长度的最小显存** | **生成 8192 长度的最小显存** |
100
+ | -------------- |---------------------|---------------------|
101
+ | FP16 / BF16 | 13.1 GB | 12.8 GB |
102
+ | INT8 | 8.2 GB | 8.1 GB |
103
+ | INT4 | 5.5 GB | 5.1 GB |
104
+
105
+ > ChatGLM2-6B 利用了 PyTorch 2.0 引入的 `torch.nn.functional.scaled_dot_product_attention` 实现高效的 Attention 计算,如果 PyTorch 版本较低则会 fallback 到朴素的 Attention 实现,出现显存占用高于上表的情况。
106
+
107
+ 我们也测试了量化对模型性能的影响。结果表明,量化对模型性能的影响在可接受范围内。
108
+
109
+ | 量化等级 | Accuracy (MMLU) | Accuracy (C-Eval dev) |
110
+ | ----- | ----- |-----------------------|
111
+ | BF16 | 45.47 | 53.57 |
112
+ | INT4 | 43.13 | 50.30 |
113
+
114
+
115
+
116
+ ## ChatGLM2-6B 示例
117
+
118
+ 相比于初代模型,ChatGLM2-6B 多个维度的能力都取得了提升,以下是一些对比示例。更多 ChatGLM2-6B 的可能,等待你来探索发现!
119
+
120
+ <details><summary><b>数理逻辑</b></summary>
121
+
122
+ ![](resources/math.png)
123
+
124
+ </details>
125
+
126
+ <details><summary><b>知识推理</b></summary>
127
+
128
+ ![](resources/knowledge.png)
129
+
130
+ </details>
131
+
132
+ <details><summary><b>长文档理解</b></summary>
133
+
134
+ ![](resources/long-context.png)
135
+
136
+ </details>
137
+
138
+ ## 使用方式
139
+ ### 环境安装
140
+ 首先需要下载本仓库:
141
+ ```shell
142
+ git clone https://github.com/THUDM/ChatGLM2-6B
143
+ cd ChatGLM2-6B
144
+ ```
145
+
146
+ 然后使用 pip 安装依赖:`pip install -r requirements.txt`,其中 `transformers` 库版本推荐为 `4.30.2`,`torch` 推荐使用 2.0 以上的版本,以获得最佳的推理性能。
147
+
148
+ ### 代码调用
149
+
150
+ 可以通过如下代码调用 ChatGLM2-6B 模型来生成对话:
151
+
152
+ ```python
153
+ >>> from transformers import AutoTokenizer, AutoModel
154
+ >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
155
+ >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda')
156
+ >>> model = model.eval()
157
+ >>> response, history = model.chat(tokenizer, "你好", history=[])
158
+ >>> print(response)
159
+ 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
160
+ >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
161
+ >>> print(response)
162
+ 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
163
+
164
+ 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
165
+ 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
166
+ 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有��于缓解紧张和焦虑,使你更容易入睡。
167
+ 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
168
+ 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
169
+ 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
170
+
171
+ 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
172
+ ```
173
+
174
+ #### 从本地加载模型
175
+ 以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm2-6b)。如果你的网络环境较差,下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地,然后从本地加载。
176
+
177
+ 从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage),然后运行
178
+ ```Shell
179
+ git clone https://huggingface.co/THUDM/chatglm2-6b
180
+ ```
181
+
182
+ 如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢,可以只下载模型实现
183
+ ```Shell
184
+ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b
185
+ ```
186
+ 然后从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载模型参数文件,并将下载的文件替换到本地的 `chatglm2-6b` 目录下。
187
+
188
+
189
+ 将模型下载到本地之后,将以上代码中的 `THUDM/chatglm2-6b` 替换为你本地的 `chatglm2-6b` 文件夹的路径,即可从本地加载模型。
190
+
191
+ 模型的实现仍然处在变动中。如果希望固定使用的模型实现以保证兼容性,可以在 `from_pretrained` 的调用中增加 `revision="v1.0"` 参数。`v1.0` 是当前最新的版本号,完整的版本列表参见 [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log)。
192
+
193
+ ### 网页版 Demo
194
+
195
+ ![web-demo](resources/web-demo.gif)
196
+
197
+ 首先安装 Gradio:`pip install gradio`,然后运行仓库中的 [web_demo.py](web_demo.py):
198
+
199
+ ```shell
200
+ python web_demo.py
201
+ ```
202
+
203
+ 程序会运行一个 Web Server,并输出地址。在浏览器中打开输出的地址即可使用。
204
+ > 默认使用了 `share=False` 启动,不会生成公网链接。如有需要公网访问的需求,可以修改为 `share=True` 启动。
205
+ >
206
+
207
+ 感谢 [@AdamBear](https://github.com/AdamBear) 实现了基于 Streamlit 的网页版 Demo `web_demo2.py`。使用时首先需要额外安装以下依赖:
208
+ ```shell
209
+ pip install streamlit streamlit-chat
210
+ ```
211
+ 然后通过以下命令运行:
212
+ ```shell
213
+ streamlit run web_demo2.py
214
+ ```
215
+ 经测试,如果输入的 prompt 较长的话,使用基于 Streamlit 的网页版 Demo 会更流畅。
216
+
217
+ ### 命令行 Demo
218
+
219
+ ![cli-demo](resources/cli-demo.png)
220
+
221
+ 运行仓库中 [cli_demo.py](cli_demo.py):
222
+
223
+ ```shell
224
+ python cli_demo.py
225
+ ```
226
+
227
+ 程序会在命令行中进行交互式的对话,在命令行中输入指示并回车即可生成回复,输入 `clear` 可以清空对话历史,输入 `stop` 终止程序。
228
+
229
+ ### API 部署
230
+ 首先需要安装额外的依赖 `pip install fastapi uvicorn`,然后运行仓库中的 [api.py](api.py):
231
+ ```shell
232
+ python api.py
233
+ ```
234
+ 默认部署在本地的 8000 端口,通过 POST 方法进行调用
235
+ ```shell
236
+ curl -X POST "http://127.0.0.1:8000" \
237
+ -H 'Content-Type: application/json' \
238
+ -d '{"prompt": "你好", "history": []}'
239
+ ```
240
+ 得到的返回值为
241
+ ```shell
242
+ {
243
+ "response":"你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。",
244
+ "history":[["你好","你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。"]],
245
+ "status":200,
246
+ "time":"2023-03-23 21:38:40"
247
+ }
248
+ ```
249
+ 感谢 [@hiyouga]() 实现了 OpenAI 格式的流式 API 部署,可以作为任意基于 ChatGPT 的应用的后端,比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api.py) 进行部署:
250
+ ```shell
251
+ python openai_api.py
252
+ ```
253
+ 进行 API 调用的示例代码为
254
+ ```python
255
+ import openai
256
+ if __name__ == "__main__":
257
+ openai.api_base = "http://localhost:8000/v1"
258
+ openai.api_key = "none"
259
+ for chunk in openai.ChatCompletion.create(
260
+ model="chatglm2-6b",
261
+ messages=[
262
+ {"role": "user", "content": "你好"}
263
+ ],
264
+ stream=True
265
+ ):
266
+ if hasattr(chunk.choices[0].delta, "content"):
267
+ print(chunk.choices[0].delta.content, end="", flush=True)
268
+ ```
269
+
270
+
271
+ ## 低成本部署
272
+
273
+ ### 模型量化
274
+
275
+ 默认情况下,模型以 FP16 精度加载,运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限,可以尝试以量化方式加载模型,使用方法如下:
276
+
277
+ ```python
278
+ # 按需修改,目前只支持 4/8 bit 量化
279
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
280
+ ```
281
+
282
+ 模型量化会带来一定的性能损失,经过测试,ChatGLM2-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
283
+
284
+ 如果你的内存不足,可以直接加载量化后的模型:
285
+ ```python
286
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()
287
+ ```
288
+
289
+ <!-- 量化模型的参数文件也可以从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载。 -->
290
+
291
+ ### CPU 部署
292
+
293
+ 如果你没有 GPU 硬件的话,也可以在 CPU 上进行推理,但是推理速度会更慢。使用方法如下(需要大概 32GB 内存)
294
+ ```python
295
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
296
+ ```
297
+ 如果你的内存不足的话,也可以使用量化后的模型
298
+ ```python
299
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float()
300
+ ```
301
+ 在 cpu 上运行量化后的模型需要安装 `gcc` 与 `openmp`。多数 Linux 发行版默认已安装。对于 Windows ,可在安装 [TDM-GCC](https://jmeubank.github.io/tdm-gcc/) 时勾选 `openmp`。 Windows 测试环境 `gcc` 版本为 `TDM-GCC 10.3.0`, Linux 为 `gcc 11.3.0`。在 MacOS 上请参考 [Q1](FAQ.md#q1)。
302
+
303
+ ### Mac 部署
304
+
305
+ 对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac,可以使用 MPS 后端来在 GPU 上运行 ChatGLM2-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly(正确的版本号应该是2.x.x.dev2023xxxx,而不是 2.x.x)。
306
+
307
+ 目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端:
308
+ ```python
309
+ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
310
+ ```
311
+
312
+ 加载半精度的 ChatGLM2-6B 模型需要大概 13GB 内存。内存较小的机器(比如 16GB 内存的 MacBook Pro),在空余内存不足的情况下会使用硬盘上的虚拟内存,导致推理速度严重变慢。
313
+ 此时可以使用量化后的模型 chatglm2-6b-int4。因为 GPU 上量化的 kernel 是使用 CUDA 编写的,因此无法在 MacOS 上使用,只能使用 CPU 进行推理。
314
+ 为了充分使用 CPU 并行,还需要[单独安装 OpenMP](FAQ.md#q1)。
315
+
316
+ ### 多卡部署
317
+ 如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型:
318
+ ```python
319
+ from utils import load_model_on_gpus
320
+ model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
321
+ ```
322
+ 即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的,你也可以传入 `device_map` 参数来自己指定。
323
+
324
+ ## 协议
325
+
326
+ 本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源,ChatGLM2-6B 模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。ChatGLM2-6B 权重对学术研究**完全开放**,在获得官方的书面许可后,亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用,我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。申请商用许可与捐赠请联系 [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn)。
327
+
328
+
329
+ ## 引用
330
+
331
+ 如果你觉得我们的工作有帮助的话,请考虑引用下列论文,ChatGLM2-6B 的论文会在近期公布,敬请期待~
332
+
333
+ ```
334
+ @article{zeng2022glm,
335
+ title={Glm-130b: An open bilingual pre-trained model},
336
+ author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
337
+ journal={arXiv preprint arXiv:2210.02414},
338
+ year={2022}
339
+ }
340
+ ```
341
+ ```
342
+ @inproceedings{du2022glm,
343
+ title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
344
+ author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
345
+ booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
346
+ pages={320--335},
347
+ year={2022}
348
+ }
349
+ ```
README_EN.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ 🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
3
+ </p>
4
+ <p align="center">
5
+ 👋 Join our <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1y7pqoloy-9b1g6T6JjA8J0KxvUjbwJw" target="_blank">Slack</a> and <a href="resources/WECHAT.md" target="_blank">WeChat</a>
6
+ </p>
7
+
8
+ ## Introduction
9
+
10
+ ChatGLM**2**-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B). It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:
11
+
12
+ 1. **Stronger Performance**: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of [GLM](https://github.com/THUDM/GLM), and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The [evaluation results](README.md#evaluation-results) show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size.
13
+ 2. **Longer Context**: Based on [FlashAttention](https://github.com/HazyResearch/flash-attention) technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations.
14
+ 3. **More Efficient Inference**: Based on [Multi-Query Attention](http://arxiv.org/abs/1911.02150) technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K.
15
+ 4. **More Open License**: The weights of ChatGLM2-6B are **fully open** to academic research, and with our official written permission, the weights of ChatGLM2-6B are also **permitted for commercial use**. If you find our open-source model useful for your business, we welcome your donation towards the development of the next-generation model ChatGLM3.
16
+
17
+ -----
18
+
19
+ The open-source ChatGLM2-6B is intended to promote the development of LLMs together with the open-source community. We earnestly request developers and everyone to abide by the [open-source license](MODEL_LICENSE). Do not use the open-source model, code, or any derivatives from the open-source project for any purposes that may harm nations or societies, or for any services that have not undergone safety assessments and legal approval. **At present, our project team has not developed any applications based on ChatGLM2-6B, including web, Android, Apple iOS, and Windows App applications.**
20
+
21
+ Although the model strives to ensure the compliance and accuracy of data at each stage of training, due to the smaller scale of the ChatGLM2-6B model, and its susceptibility to probabilistic randomness, the accuracy of output content cannot be guaranteed, and the model can easily be misled. **Our project does not assume any risks or responsibilities arising from data security, public opinion risks, or any instances of the model being misled, abused, disseminated, or improperly used due to the open-source model and code.**
22
+
23
+ ## Projects
24
+ Open source projects that accelerate ChatGLM2:
25
+ * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time CPU inference on a MacBook accelerated by quantization, similar to llama.cpp.
26
+
27
+ ## Evaluation
28
+ We selected some typical Chinese and English datasets for evaluation. Below are the evaluation results of the ChatGLM2-6B model on [MMLU](https://github.com/hendrycks/test) (English), [C-Eval](https://cevalbenchmark.com/static/leaderboard.html) (Chinese), [GSM8K](https://github.com/openai/grade-school-math) (Mathematics), [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard) (English).
29
+
30
+ ### MMLU
31
+
32
+ | Model | Average | STEM | Social Sciences | Humanities | Others |
33
+ | ----- | ----- | ---- | ----- | ----- | ----- |
34
+ | ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
35
+ | ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
36
+ | ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
37
+
38
+ > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
39
+
40
+ ### C-Eval
41
+
42
+ | Model | Average | STEM | Social Sciences | Humanities | Others |
43
+ | ----- | ---- | ---- | ----- | ----- | ----- |
44
+ | ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
45
+ | ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
46
+ | ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
47
+
48
+ > Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
49
+
50
+ ### GSM8K
51
+
52
+ | Model | Accuracy | Accuracy (Chinese)* |
53
+ | ----- | ----- | ----- |
54
+ | ChatGLM-6B | 4.82 | 5.85 |
55
+ | ChatGLM2-6B (base) | 32.37 | 28.95 |
56
+ | ChatGLM2-6B | 28.05 | 20.45 |
57
+
58
+ > All model versions are evaluated under few-shot CoT, and CoT prompts are from http://arxiv.org/abs/2201.11903
59
+ > \* We translate a 500-query subset of GSM8K and its corresponding CoT prompts using machine translation API and subsequent human proofreading.
60
+
61
+
62
+ ### BBH
63
+
64
+ | Model | Accuracy |
65
+ | ----- | ----- |
66
+ | ChatGLM-6B | 18.73 |
67
+ | ChatGLM2-6B (base) | 33.68 |
68
+ | ChatGLM2-6B | 30.00 |
69
+
70
+ > All model versions are evaluated under few-shot CoT, and CoT prompts are from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
71
+
72
+ ## Inference Efficiency
73
+ ChatGLM2-6B employs [Multi-Query Attention](http://arxiv.org/abs/1911.02150) to improve inference speed. Here is a comparison of the average speed for generating 2000 tokens.
74
+
75
+
76
+ | Model | Inference Speed (tokens/s) |
77
+ | ---- | ----- |
78
+ | ChatGLM-6B | 31.49 |
79
+ | ChatGLM2-6B | 44.62 |
80
+
81
+ > Under our official implementation, batch size = 1, max length = 2048, bf16 precision, tested with an A100-SXM-80G and PyTorch 2.0 environment
82
+
83
+ Multi-Query Attention also reduces the GPU memory usage of the KV Cache during inference. Additionally, ChatGLM2-6B uses Causal Mask for dialogue training, which allows the reuse of the KV Cache from previous rounds in continuous dialogues, further optimizing GPU memory usage. Therefore, when performing INT4 quantization inference with a 6GB GPU, while the first-generation ChatGLM-6B can only generate a maximum of 1119 tokens before running out of memory, ChatGLM2-6B can generate at least 8192 tokens.
84
+
85
+ | **Quantization** | **Encoding 2048 Tokens** | **Decoding 8192 Tokens** |
86
+ | -------------- | --------------------- | --------------- |
87
+ | FP16 / BF16 | 13.1 GB | 12.8 GB |
88
+ | INT8 | 8.2 GB | 8.1 GB |
89
+ | INT4 | 5.5 GB | 5.1 GB |
90
+
91
+ > ChatGLM2-6B takes advantage of `torch.nn.functional.scaled_dot_product_attention` introduced in PyTorch 2.0 for efficient Attention computation. If the PyTorch version is lower, it will fallback to the naive Attention implementation, which may result in higher GPU memory usage than shown in the table above.
92
+
93
+ We also tested the impact of quantization on model performance. The results show that the impact of quantization on model performance is within an acceptable range.
94
+
95
+ | Quantization | Accuracy (MMLU) | Accuracy (C-Eval dev) |
96
+ | ----- | ----- |-----------------------|
97
+ | BF16 | 45.47 | 53.57 |
98
+ | INT4 | 43.13 | 50.30 |
99
+
100
+
101
+ ## ChatGLM2-6B Examples
102
+
103
+ Compared to the first-generation model, ChatGLM2-6B has made improvements in multiple dimensions. Below are some comparison examples. More possibilities with ChatGLM2-6B are waiting for you to explore and discover!
104
+
105
+ <details><summary><b>Mathematics and Logic</b></summary>
106
+
107
+ ![](examples/math.png)
108
+
109
+ </details>
110
+
111
+ <details><summary><b>Knowledge Reasoning</b></summary>
112
+
113
+ ![](examples/knowledge.png)
114
+
115
+ </details>
116
+
117
+ <details><summary><b>Long Document Understanding</b></summary>
118
+
119
+ ![](examples/long-context.png)
120
+
121
+ </details>
122
+
123
+ ## Getting Started
124
+ ### Environment Setup
125
+
126
+ Install dependencies with pip: `pip install -r requirements.txt`. It's recommended to use version `4.27.1` for the `transformers` library and use version 2.0 or higher for `torch` to achieve the best inference performance.
127
+
128
+ We provide a web page demo and a command line demo. You need to download this repository to use them:
129
+
130
+ ```shell
131
+ git clone https://github.com/THUDM/ChatGLM2-6B
132
+ cd ChatGLM2-6B
133
+ ```
134
+
135
+ ### Usage
136
+
137
+ Generate dialogue with the following code:
138
+
139
+ ```python
140
+ >>> from transformers import AutoTokenizer, AutoModel
141
+ >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
142
+ >>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda').eval()
143
+ >>> response, history = model.chat(tokenizer, "你好", history=[])
144
+ >>> print(response)
145
+ 你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
146
+ >>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
147
+ >>> print(response)
148
+ 晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
149
+
150
+ 1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起��。
151
+ 2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
152
+ 3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
153
+ 4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
154
+ 5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
155
+ 6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
156
+
157
+ 如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
158
+ ```
159
+ The implementation of the model is still in development. If you want to fix the used model implementation to ensure compatibility, you can add the `revision="v1.0"` parameter in the `from_pretrained` call. `v1.0` is the latest version number. For a complete list of versions, see [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log).
160
+
161
+ ### Web Demo
162
+
163
+ ![web-demo](resources/web-demo.gif)
164
+
165
+ Install Gradio `pip install gradio`,and run [web_demo.py](web_demo.py):
166
+
167
+ ```shell
168
+ python web_demo.py
169
+ ```
170
+
171
+ The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo.
172
+
173
+ #### CLI Demo
174
+
175
+ ![cli-demo](resources/cli-demo.png)
176
+
177
+ Run [cli_demo.py](cli_demo.py) in the repo:
178
+
179
+ ```shell
180
+ python cli_demo.py
181
+ ```
182
+
183
+ The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program.
184
+
185
+ ## API Deployment
186
+ First install the additional dependency `pip install fastapi uvicorn`. The run [api.py](api.py) in the repo.
187
+ ```shell
188
+ python api.py
189
+ ```
190
+ By default the api runs at the`8000`port of the local machine. You can call the API via
191
+ ```shell
192
+ curl -X POST "http://127.0.0.1:8000" \
193
+ -H 'Content-Type: application/json' \
194
+ -d '{"prompt": "你好", "history": []}'
195
+ ```
196
+ The returned value is
197
+ ```shell
198
+ {
199
+ "response":"你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。",
200
+ "history":[["你好","你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。"]],
201
+ "status":200,
202
+ "time":"2023-03-23 21:38:40"
203
+ }
204
+ ```
205
+ ## Deployment
206
+
207
+ ### Quantization
208
+
209
+ By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization:
210
+
211
+ ```python
212
+ # hange according to your hardware. Only support 4/8 bit quantization now.
213
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
214
+ ```
215
+
216
+ Model quantization will bring some performance loss on datasets. But after testing, ChatGLM2-6B can still perform natural and smooth generation under 4-bit quantization.
217
+
218
+ ### CPU Deployment
219
+
220
+ If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):
221
+ ```python
222
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
223
+ ```
224
+
225
+ ### Inference on Mac
226
+
227
+ For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. (The correct version number should be 2.1.0.dev2023xxxx, not 2.0.0).
228
+
229
+ Currently you must [load the model locally](README_en.md#load-the-model-locally) on MacOS. Change the code to load the model from your local path, and use the mps backend:
230
+ ```python
231
+ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
232
+ ```
233
+
234
+ Loading a FP16 ChatGLM-6B model requires about 13GB of memory. Machines with less memory (such as a MacBook Pro with 16GB of memory) will use the virtual memory on the hard disk when there is insufficient free memory, resulting in a serious slowdown in inference speed.
235
+
236
+ ## License
237
+
238
+ The code of this repository is licensed under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0). The use of the ChatGLM2-6B model weights is subject to the [Model License](MODEL_LICENSE). ChatGLM2-6B weights are **completely open** for academic research, and **commercial use** is also allowed after **obtaining official written permission**. If you find our open source model useful for your business, we welcome your donations towards the development of the next generation model, ChatGLM3. For related matters, please contact [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn).
239
+
240
+ ## Citation
241
+
242
+ If you find our work useful, please consider citing the following papers. The technical report for ChatGLM2-6B will be out soon.
243
+
244
+ ```
245
+ @article{zeng2022glm,
246
+ title={Glm-130b: An open bilingual pre-trained model},
247
+ author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
248
+ journal={arXiv preprint arXiv:2210.02414},
249
+ year={2022}
250
+ }
251
+ ```
252
+ ```
253
+ @inproceedings{du2022glm,
254
+ title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
255
+ author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
256
+ booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
257
+ pages={320--335},
258
+ year={2022}
259
+ }
260
+ ```
__pycache__/utils.cpython-310.pyc ADDED
Binary file (1.39 kB). View file
 
api.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, Request
2
+ from transformers import AutoTokenizer, AutoModel
3
+ import uvicorn, json, datetime
4
+ import torch
5
+
6
+ DEVICE = "cuda"
7
+ DEVICE_ID = "0"
8
+ CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
9
+
10
+
11
+ def torch_gc():
12
+ if torch.cuda.is_available():
13
+ with torch.cuda.device(CUDA_DEVICE):
14
+ torch.cuda.empty_cache()
15
+ torch.cuda.ipc_collect()
16
+
17
+
18
+ app = FastAPI()
19
+
20
+
21
+ @app.post("/")
22
+ async def create_item(request: Request):
23
+ global model, tokenizer
24
+ json_post_raw = await request.json()
25
+ json_post = json.dumps(json_post_raw)
26
+ json_post_list = json.loads(json_post)
27
+ prompt = json_post_list.get('prompt')
28
+ history = json_post_list.get('history')
29
+ max_length = json_post_list.get('max_length')
30
+ top_p = json_post_list.get('top_p')
31
+ temperature = json_post_list.get('temperature')
32
+ response, history = model.chat(tokenizer,
33
+ prompt,
34
+ history=history,
35
+ max_length=max_length if max_length else 2048,
36
+ top_p=top_p if top_p else 0.7,
37
+ temperature=temperature if temperature else 0.95)
38
+ now = datetime.datetime.now()
39
+ time = now.strftime("%Y-%m-%d %H:%M:%S")
40
+ answer = {
41
+ "response": response,
42
+ "history": history,
43
+ "status": 200,
44
+ "time": time
45
+ }
46
+ log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
47
+ print(log)
48
+ torch_gc()
49
+ return answer
50
+
51
+
52
+ if __name__ == '__main__':
53
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
54
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
55
+ # 多显卡支持,使用下面三行代替上面两行,将num_gpus改为你实际的显卡数量
56
+ # model_path = "THUDM/chatglm2-6b"
57
+ # tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
58
+ # model = load_model_on_gpus(model_path, num_gpus=2)
59
+ model.eval()
60
+ uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
cli_demo.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import platform
3
+ import signal
4
+ from transformers import AutoTokenizer, AutoModel
5
+ import readline
6
+
7
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
8
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
9
+ # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
10
+ # from utils import load_model_on_gpus
11
+ # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
12
+ model = model.eval()
13
+
14
+ os_name = platform.system()
15
+ clear_command = 'cls' if os_name == 'Windows' else 'clear'
16
+ stop_stream = False
17
+
18
+
19
+ def build_prompt(history):
20
+ prompt = "欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序"
21
+ for query, response in history:
22
+ prompt += f"\n\n用户:{query}"
23
+ prompt += f"\n\nChatGLM2-6B:{response}"
24
+ return prompt
25
+
26
+
27
+ def signal_handler(signal, frame):
28
+ global stop_stream
29
+ stop_stream = True
30
+
31
+
32
+ def main():
33
+ past_key_values, history = None, []
34
+ global stop_stream
35
+ print("欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
36
+ while True:
37
+ query = input("\n用户:")
38
+ if query.strip() == "stop":
39
+ break
40
+ if query.strip() == "clear":
41
+ past_key_values, history = None, []
42
+ os.system(clear_command)
43
+ print("欢迎使用 ChatGLM2-6B 模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
44
+ continue
45
+ print("\nChatGLM:", end="")
46
+ current_length = 0
47
+ for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history,
48
+ past_key_values=past_key_values,
49
+ return_past_key_values=True):
50
+ if stop_stream:
51
+ stop_stream = False
52
+ break
53
+ else:
54
+ print(response[current_length:], end="", flush=True)
55
+ current_length = len(response)
56
+ print("")
57
+
58
+
59
+ if __name__ == "__main__":
60
+ main()
evaluation/README.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ 首先从 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/e84444333b6d434ea7b0) 下载处理好的 C-Eval 数据集,解压到 `evaluation` 目录下。然后运行
2
+
3
+ ```shell
4
+ cd evaluation
5
+ python evaluate_ceval.py
6
+ ```
7
+
8
+ 这个脚本会在C-Eval的验证集上进行预测并输出准确率。如果想要得到测试集上的结果可以将代码中的 `./CEval/val/**/*.jsonl` 改为 `./CEval/test/**/*.jsonl`,并按照 C-Eval 规定的格式保存结果并在 [官网](https://cevalbenchmark.com/) 上提交。
9
+
10
+ 汇报的结果使用的是内部的并行测试框架,结果可能会有轻微波动。
evaluation/evaluate_ceval.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import re
4
+ import json
5
+ import torch
6
+ import torch.utils.data
7
+ from transformers import AutoTokenizer, AutoModel
8
+ from tqdm import tqdm
9
+
10
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
11
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).bfloat16().cuda()
12
+
13
+ choices = ["A", "B", "C", "D"]
14
+ choice_tokens = [tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
15
+
16
+
17
+ def build_prompt(text):
18
+ return "[Round {}]\n\n问:{}\n\n答:".format(1, text)
19
+
20
+
21
+ extraction_prompt = '综上所述,ABCD中正确的选项是:'
22
+
23
+ accuracy_dict, count_dict = {}, {}
24
+ with torch.no_grad():
25
+ for entry in glob.glob("./CEval/val/**/*.jsonl", recursive=True):
26
+ dataset = []
27
+ with open(entry, encoding='utf-8') as file:
28
+ for line in file:
29
+ dataset.append(json.loads(line))
30
+ correct = 0
31
+ dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
32
+ for batch in tqdm(dataloader):
33
+ texts = batch["inputs_pretokenized"]
34
+ queries = [build_prompt(query) for query in texts]
35
+ inputs = tokenizer(queries, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
36
+ outputs = model.generate(**inputs, do_sample=False, max_new_tokens=512)
37
+ intermediate_outputs = []
38
+ for idx in range(len(outputs)):
39
+ output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
40
+ response = tokenizer.decode(output)
41
+ intermediate_outputs.append(response)
42
+ answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
43
+ zip(texts, intermediate_outputs)]
44
+ input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
45
+ inputs = tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
46
+ outputs = model(**inputs, return_last_logit=True)
47
+ logits = outputs.logits[:, -1]
48
+ logits = logits[:, choice_tokens]
49
+ preds = logits.argmax(dim=-1)
50
+ correct += (preds.cpu() == batch["label"]).sum().item()
51
+ accuracy = correct / len(dataset)
52
+ print(entry, accuracy)
53
+ accuracy_dict[entry] = accuracy
54
+ count_dict[entry] = len(dataset)
55
+
56
+ acc_total, count_total = 0.0, 0
57
+ for key in accuracy_dict:
58
+ acc_total += accuracy_dict[key] * count_dict[key]
59
+ count_total += count_dict[key]
60
+ print(acc_total / count_total)
openai_api.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Implements API for ChatGLM2-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
3
+ # Usage: python openai_api.py
4
+ # Visit http://localhost:8000/docs for documents.
5
+
6
+
7
+ import time
8
+ import torch
9
+ import uvicorn
10
+ from pydantic import BaseModel, Field
11
+ from fastapi import FastAPI, HTTPException
12
+ from fastapi.middleware.cors import CORSMiddleware
13
+ from contextlib import asynccontextmanager
14
+ from typing import Any, Dict, List, Literal, Optional, Union
15
+ from transformers import AutoTokenizer, AutoModel
16
+ from sse_starlette.sse import ServerSentEvent, EventSourceResponse
17
+
18
+
19
+ @asynccontextmanager
20
+ async def lifespan(app: FastAPI): # collects GPU memory
21
+ yield
22
+ if torch.cuda.is_available():
23
+ torch.cuda.empty_cache()
24
+ torch.cuda.ipc_collect()
25
+
26
+
27
+ app = FastAPI(lifespan=lifespan)
28
+
29
+ app.add_middleware(
30
+ CORSMiddleware,
31
+ allow_origins=["*"],
32
+ allow_credentials=True,
33
+ allow_methods=["*"],
34
+ allow_headers=["*"],
35
+ )
36
+
37
+ class ModelCard(BaseModel):
38
+ id: str
39
+ object: str = "model"
40
+ created: int = Field(default_factory=lambda: int(time.time()))
41
+ owned_by: str = "owner"
42
+ root: Optional[str] = None
43
+ parent: Optional[str] = None
44
+ permission: Optional[list] = None
45
+
46
+
47
+ class ModelList(BaseModel):
48
+ object: str = "list"
49
+ data: List[ModelCard] = []
50
+
51
+
52
+ class ChatMessage(BaseModel):
53
+ role: Literal["user", "assistant", "system"]
54
+ content: str
55
+
56
+
57
+ class DeltaMessage(BaseModel):
58
+ role: Optional[Literal["user", "assistant", "system"]] = None
59
+ content: Optional[str] = None
60
+
61
+
62
+ class ChatCompletionRequest(BaseModel):
63
+ model: str
64
+ messages: List[ChatMessage]
65
+ temperature: Optional[float] = None
66
+ top_p: Optional[float] = None
67
+ max_length: Optional[int] = None
68
+ stream: Optional[bool] = False
69
+
70
+
71
+ class ChatCompletionResponseChoice(BaseModel):
72
+ index: int
73
+ message: ChatMessage
74
+ finish_reason: Literal["stop", "length"]
75
+
76
+
77
+ class ChatCompletionResponseStreamChoice(BaseModel):
78
+ index: int
79
+ delta: DeltaMessage
80
+ finish_reason: Optional[Literal["stop", "length"]]
81
+
82
+
83
+ class ChatCompletionResponse(BaseModel):
84
+ model: str
85
+ object: Literal["chat.completion", "chat.completion.chunk"]
86
+ choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
87
+ created: Optional[int] = Field(default_factory=lambda: int(time.time()))
88
+
89
+
90
+ @app.get("/v1/models", response_model=ModelList)
91
+ async def list_models():
92
+ global model_args
93
+ model_card = ModelCard(id="gpt-3.5-turbo")
94
+ return ModelList(data=[model_card])
95
+
96
+
97
+ @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
98
+ async def create_chat_completion(request: ChatCompletionRequest):
99
+ global model, tokenizer
100
+
101
+ if request.messages[-1].role != "user":
102
+ raise HTTPException(status_code=400, detail="Invalid request")
103
+ query = request.messages[-1].content
104
+
105
+ prev_messages = request.messages[:-1]
106
+ if len(prev_messages) > 0 and prev_messages[0].role == "system":
107
+ query = prev_messages.pop(0).content + query
108
+
109
+ history = []
110
+ if len(prev_messages) % 2 == 0:
111
+ for i in range(0, len(prev_messages), 2):
112
+ if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant":
113
+ history.append([prev_messages[i].content, prev_messages[i+1].content])
114
+
115
+ if request.stream:
116
+ generate = predict(query, history, request.model)
117
+ return EventSourceResponse(generate, media_type="text/event-stream")
118
+
119
+ response, _ = model.chat(tokenizer, query, history=history)
120
+ choice_data = ChatCompletionResponseChoice(
121
+ index=0,
122
+ message=ChatMessage(role="assistant", content=response),
123
+ finish_reason="stop"
124
+ )
125
+
126
+ return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion")
127
+
128
+
129
+ async def predict(query: str, history: List[List[str]], model_id: str):
130
+ global model, tokenizer
131
+
132
+ choice_data = ChatCompletionResponseStreamChoice(
133
+ index=0,
134
+ delta=DeltaMessage(role="assistant"),
135
+ finish_reason=None
136
+ )
137
+ chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
138
+ yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
139
+
140
+ current_length = 0
141
+
142
+ for new_response, _ in model.stream_chat(tokenizer, query, history):
143
+ if len(new_response) == current_length:
144
+ continue
145
+
146
+ new_text = new_response[current_length:]
147
+ current_length = len(new_response)
148
+
149
+ choice_data = ChatCompletionResponseStreamChoice(
150
+ index=0,
151
+ delta=DeltaMessage(content=new_text),
152
+ finish_reason=None
153
+ )
154
+ chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
155
+ yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
156
+
157
+
158
+ choice_data = ChatCompletionResponseStreamChoice(
159
+ index=0,
160
+ delta=DeltaMessage(),
161
+ finish_reason="stop"
162
+ )
163
+ chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
164
+ yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
165
+ yield '[DONE]'
166
+
167
+
168
+
169
+ if __name__ == "__main__":
170
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
171
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
172
+ # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
173
+ # from utils import load_model_on_gpus
174
+ # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
175
+ model.eval()
176
+
177
+ uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
ptuning/README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChatGLM2-6B-PT
2
+ 本仓库实现了对于 ChatGLM2-6B 模型基于 [P-Tuning v2](https://github.com/THUDM/P-tuning-v2) 的微调。P-Tuning v2 将需要微调的参数量减少到原来的 0.1%,再通过模型量化、Gradient Checkpoint 等方法,最低只需要 7GB 显存即可运行。
3
+
4
+ 下面以 [ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法。
5
+
6
+ ## 软件依赖
7
+ 运行微调除 ChatGLM2-6B 的依赖之外,还需要安装以下依赖
8
+ ```
9
+ pip install rouge_chinese nltk jieba datasets
10
+ ```
11
+ ## 使用方法
12
+
13
+ ### 下载数据集
14
+ ADGEN 数据集任务为根据输入(content)生成一段广告词(summary)。
15
+
16
+ ```json
17
+ {
18
+ "content": "类型#上衣*版型#宽松*版型#显瘦*图案#线条*衣样式#衬衫*衣袖型#泡泡袖*衣款式#抽绳",
19
+ "summary": "这件衬衫的款式非常的宽松,利落的线条可以很好的隐藏身材上的小缺点,穿在身上有着很好的显瘦效果。领口装饰了一个可爱的抽绳,漂亮的绳结展现出了十足的个性,配合时尚的泡泡袖型,尽显女性甜美可爱的气息。"
20
+ }
21
+ ```
22
+
23
+ 从 [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) 下载处理好的 ADGEN 数据集,将解压后的 `AdvertiseGen` 目录放到本目录下。
24
+
25
+ ### 训练
26
+
27
+ #### P-Tuning v2
28
+
29
+ 运行以下指令进行训练:
30
+ ```shell
31
+ bash train.sh
32
+ ```
33
+ `train.sh` 中的 `PRE_SEQ_LEN` 和 `LR` 分别是 soft prompt 长度和训练的学习率,可以进行调节以取得最佳的效果。P-Tuning-v2 方法会冻结全部的模型参数,可通过调整 `quantization_bit` 来被原始模型的量化等级,不加此选项则为 FP16 精度加载。
34
+
35
+ 在默认配置 `quantization_bit=4`、`per_device_train_batch_size=1`、`gradient_accumulation_steps=16` 下,INT4 的模型参数被冻结,一次训练迭代会以 1 的批处理大小进行 16 次累加的前后向传播,等效为 16 的总批处理大小,此时最低只需 6.7G 显存。若想在同等批处理大小下提升训练效率,可在二者乘积不变的情况下,加大 `per_device_train_batch_size` 的值,但也会带来更多的显存消耗,请根据实际情况酌情调整。
36
+
37
+ 如果你想要[从本地加载模型](../README.md#从本地加载模型),可以将 `train.sh` 中的 `THUDM/chatglm2-6b` 改为你本地的模型路径。
38
+
39
+ #### Finetune
40
+
41
+ 如果需要进行全参数的 Finetune,需要安装 [Deepspeed](https://github.com/microsoft/DeepSpeed),然后运行以下指令:
42
+
43
+ ```shell
44
+ bash ds_train_finetune.sh
45
+ ```
46
+
47
+ ### 推理
48
+
49
+ 在 P-tuning v2 训练时模型只保存 PrefixEncoder 部分的参数,所以在推理时需要同时加载原 ChatGLM2-6B 模型以及 PrefixEncoder 的权重,因此需要指定 `evaluate.sh` 中的参数:
50
+
51
+ ```shell
52
+ --model_name_or_path THUDM/chatglm2-6b
53
+ --ptuning_checkpoint $CHECKPOINT_PATH
54
+ ```
55
+
56
+ 如果是,只需要跟之前一样设定 `model_name_or_path`:
57
+
58
+ ```shell
59
+ --model_name_or_path $CHECKPOINT_PATH
60
+ ```
61
+
62
+ 评测指标为中文 Rouge score 和 BLEU-4。生成的结果保存在
63
+ `./output/adgen-chatglm2-6b-pt-128-2e-2/generated_predictions.txt`。
64
+
65
+ ### 例子
66
+ #### 示例1
67
+ * Input: 类型#上衣\*材质#牛仔布\*颜色#白色\*风格#简约\*图案#刺绣\*衣样式#外套\*衣款式#破洞
68
+ * Label: 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。
69
+ * Output[微调前]: 这件上衣的材质是牛仔布,颜色是白色,风格是简约,图案是刺绣,衣样式是外套,衣款式是破洞。
70
+ * Output[微调后]: 这是一款简约的牛仔外套,破洞设计,将牛仔布破洞,带来一种随意与个性。破洞的牛仔外套,展现出时尚气息,带来一种休闲感。同时,刺绣图案,让整件外套更加立体。
71
+
72
+ #### 示例2
73
+
74
+ * Input: 类型#裙\*版型#显瘦\*风格#文艺\*风格#简约\*图案#印花\*图案#撞色\*裙下摆#压褶\*裙长#连衣裙\*裙领型#圆领
75
+ * Label: 文艺个性的印花连衣裙,藏青色底蕴,低调又大气,撞色太阳花分布整个裙身,绚丽而美好,带来时尚减龄的气质。基础款的舒适圆领,简约不失大方,勾勒精致脸庞。领后是一粒包布扣固定,穿脱十分方便。前片立体的打褶设计,搭配后片压褶的做工,增添层次和空间感,显瘦又有型。
76
+ * Output[微调前]: 类型#裙*版型#显瘦*风格#文艺*风格#简约*图案#印花*图案#撞色*裙下摆#压褶*裙长#连衣裙*裙领型#圆领 1\. 连衣裙:简约风格,裙长为膝盖以上,裙领型为圆领。2\. 裙下摆:压褶设计,使裙摆呈现出流畅的褶皱效果。3\. 裙领型:裙领型为圆领,使穿上连衣裙后更加有型。4\. 版型:采用显瘦设计,让连衣裙看起来更加苗条。5\. 风格:文艺风格,让连衣裙更加有内涵和品味。6\. 图案:印花设计,在连衣裙上印有独特的图案。7\. 撞色:采用撞色设计,让连衣裙在色彩上更加鲜明、富有层次感。
77
+ * Output[微调后]: 这是一款文艺范的连衣裙,以印花为元素,采用简约的印花,既能够突出文艺气质,又能够展现简约风。在印花的同时又有领子和裙摆的压褶设计,更加凸显文艺气质。简约而不会过于单调,搭配出街,穿着十分舒适。
78
+
79
+
80
+ ## 模型部署
81
+ 首先载入Tokenizer:
82
+
83
+ ```python
84
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
85
+
86
+ # 载入Tokenizer
87
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
88
+ ```
89
+
90
+ 1. 如果需要加载的 P-Tuning 的 checkpoint:
91
+
92
+ ```python
93
+ config = AutoConfig.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, pre_seq_len=128)
94
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", config=config, trust_remote_code=True)
95
+ prefix_state_dict = torch.load(os.path.join(CHECKPOINT_PATH, "pytorch_model.bin"))
96
+ new_prefix_state_dict = {}
97
+ for k, v in prefix_state_dict.items():
98
+ if k.startswith("transformer.prefix_encoder."):
99
+ new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
100
+ model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
101
+ ```
102
+ 注意你可能需要将 `pre_seq_len` 改成你训练时的实际值。如果你是[从本地加载模型](../README.md#从本地加载模型)的话,需要将 `THUDM/chatglm2-6b` 改成本地的模型路径(注意不是checkpoint路径)。
103
+
104
+ 2. 如果需要加载的是全参数微调的 checkpoint,则直接加载整个 checkpoint:
105
+
106
+ ```python
107
+ model = AutoModel.from_pretrained(CHECKPOINT_PATH, trust_remote_code=True)
108
+ ```
109
+
110
+ 之后根据需求可以进行量化,也可以直接使用:
111
+
112
+ ```python
113
+ # Comment out the following line if you don't use quantization
114
+ model = model.quantize(4)
115
+ model = model.cuda()
116
+ model = model.eval()
117
+
118
+ response, history = model.chat(tokenizer, "你好", history=[])
119
+ ```
120
+
121
+ 你也可以直接运行支持加载 P-Tuning v2 checkpoint 的 [web demo](./web_demo.py)
122
+ ```shell
123
+ bash web_demo.sh
124
+ ```
125
+ 可能需要修改 [web_demo.sh](./web_demo.sh) 的内容以符合你实际的 checkpoint 情况。
126
+
127
+ ## 使用自己的数据集
128
+ 修改 `train.sh` 和 `evaluate.sh` 中的 `train_file`、`validation_file`和`test_file`为你自己的 JSON 格式数据集路径,并将 `prompt_column` 和 `response_column` 改为 JSON 文件中输入文本和输出文本对应的 KEY。可能还需要增大 `max_source_length` 和 `max_target_length` 来匹配你自己的数据集中的最大输入输出长度。
129
+
130
+ ## 对话数据集
131
+
132
+ 如需要使用多轮对话数据对模型进行微调,可以提供聊天历史,例如以下是一个三轮对话的训练数据:
133
+
134
+ ```json lines
135
+ {"prompt": "长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "response": "用电脑能读数据流吗?水温多少", "history": []}
136
+ {"prompt": "95", "response": "上下水管温差怎么样啊?空气是不是都排干净了呢?", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"]]}
137
+ {"prompt": "是的。上下水管都好的", "response": "那就要检查线路了,一般风扇继电器是由电脑控制吸合的,如果电路存在断路,或者电脑坏了的话会出现继电器不吸合的情况!", "history": [["长城h3风扇不转。继电器好的。保险丝好的传感器新的风扇也新的这是为什么。就是继电器缺一个信号线", "用电脑能读数据流吗?水温多少"], ["95", "上下水管温差怎么样啊?空气是不是都排干净了呢?"]]}
138
+ ```
139
+
140
+ 训练时需要指定 `--history_column` 为数据中聊天历史的 key(在此例子中是 `history`),将自动把聊天历史拼接。要注意超过输入长度 `max_source_length` 的内容会被截断。
141
+
142
+ 可以参考以下指令:
143
+
144
+ ```shell
145
+ bash train_chat.sh
146
+ ```
147
+
148
+ ## 引用
149
+
150
+ ```
151
+ @inproceedings{liu2022p,
152
+ title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
153
+ author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
154
+ booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
155
+ pages={61--68},
156
+ year={2022}
157
+ }
158
+ ```
159
+
160
+
161
+
ptuning/arguments.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass, field
2
+ from typing import Optional
3
+
4
+
5
+ @dataclass
6
+ class ModelArguments:
7
+ """
8
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
9
+ """
10
+
11
+ model_name_or_path: str = field(
12
+ metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
13
+ )
14
+ ptuning_checkpoint: str = field(
15
+ default=None, metadata={"help": "Path to p-tuning v2 checkpoints"}
16
+ )
17
+ config_name: Optional[str] = field(
18
+ default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
19
+ )
20
+ tokenizer_name: Optional[str] = field(
21
+ default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
22
+ )
23
+ cache_dir: Optional[str] = field(
24
+ default=None,
25
+ metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
26
+ )
27
+ use_fast_tokenizer: bool = field(
28
+ default=True,
29
+ metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
30
+ )
31
+ model_revision: str = field(
32
+ default="main",
33
+ metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
34
+ )
35
+ use_auth_token: bool = field(
36
+ default=False,
37
+ metadata={
38
+ "help": (
39
+ "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
40
+ "with private models)."
41
+ )
42
+ },
43
+ )
44
+ resize_position_embeddings: Optional[bool] = field(
45
+ default=None,
46
+ metadata={
47
+ "help": (
48
+ "Whether to automatically resize the position embeddings if `max_source_length` exceeds "
49
+ "the model's position embeddings."
50
+ )
51
+ },
52
+ )
53
+ quantization_bit: Optional[int] = field(
54
+ default=None
55
+ )
56
+ pre_seq_len: Optional[int] = field(
57
+ default=None
58
+ )
59
+ prefix_projection: bool = field(
60
+ default=False
61
+ )
62
+
63
+
64
+ @dataclass
65
+ class DataTrainingArguments:
66
+ """
67
+ Arguments pertaining to what data we are going to input our model for training and eval.
68
+ """
69
+
70
+ lang: Optional[str] = field(default=None, metadata={"help": "Language id for summarization."})
71
+
72
+ dataset_name: Optional[str] = field(
73
+ default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
74
+ )
75
+ dataset_config_name: Optional[str] = field(
76
+ default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
77
+ )
78
+ prompt_column: Optional[str] = field(
79
+ default=None,
80
+ metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
81
+ )
82
+ response_column: Optional[str] = field(
83
+ default=None,
84
+ metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
85
+ )
86
+ history_column: Optional[str] = field(
87
+ default=None,
88
+ metadata={"help": "The name of the column in the datasets containing the history of chat."},
89
+ )
90
+ train_file: Optional[str] = field(
91
+ default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
92
+ )
93
+ validation_file: Optional[str] = field(
94
+ default=None,
95
+ metadata={
96
+ "help": (
97
+ "An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
98
+ )
99
+ },
100
+ )
101
+ test_file: Optional[str] = field(
102
+ default=None,
103
+ metadata={
104
+ "help": "An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
105
+ },
106
+ )
107
+ overwrite_cache: bool = field(
108
+ default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
109
+ )
110
+ preprocessing_num_workers: Optional[int] = field(
111
+ default=None,
112
+ metadata={"help": "The number of processes to use for the preprocessing."},
113
+ )
114
+ max_source_length: Optional[int] = field(
115
+ default=1024,
116
+ metadata={
117
+ "help": (
118
+ "The maximum total input sequence length after tokenization. Sequences longer "
119
+ "than this will be truncated, sequences shorter will be padded."
120
+ )
121
+ },
122
+ )
123
+ max_target_length: Optional[int] = field(
124
+ default=128,
125
+ metadata={
126
+ "help": (
127
+ "The maximum total sequence length for target text after tokenization. Sequences longer "
128
+ "than this will be truncated, sequences shorter will be padded."
129
+ )
130
+ },
131
+ )
132
+ val_max_target_length: Optional[int] = field(
133
+ default=None,
134
+ metadata={
135
+ "help": (
136
+ "The maximum total sequence length for validation target text after tokenization. Sequences longer "
137
+ "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
138
+ "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
139
+ "during ``evaluate`` and ``predict``."
140
+ )
141
+ },
142
+ )
143
+ pad_to_max_length: bool = field(
144
+ default=False,
145
+ metadata={
146
+ "help": (
147
+ "Whether to pad all samples to model maximum sentence length. "
148
+ "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
149
+ "efficient on GPU but very bad for TPU."
150
+ )
151
+ },
152
+ )
153
+ max_train_samples: Optional[int] = field(
154
+ default=None,
155
+ metadata={
156
+ "help": (
157
+ "For debugging purposes or quicker training, truncate the number of training examples to this "
158
+ "value if set."
159
+ )
160
+ },
161
+ )
162
+ max_eval_samples: Optional[int] = field(
163
+ default=None,
164
+ metadata={
165
+ "help": (
166
+ "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
167
+ "value if set."
168
+ )
169
+ },
170
+ )
171
+ max_predict_samples: Optional[int] = field(
172
+ default=None,
173
+ metadata={
174
+ "help": (
175
+ "For debugging purposes or quicker training, truncate the number of prediction examples to this "
176
+ "value if set."
177
+ )
178
+ },
179
+ )
180
+ num_beams: Optional[int] = field(
181
+ default=None,
182
+ metadata={
183
+ "help": (
184
+ "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
185
+ "which is used during ``evaluate`` and ``predict``."
186
+ )
187
+ },
188
+ )
189
+ ignore_pad_token_for_loss: bool = field(
190
+ default=True,
191
+ metadata={
192
+ "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
193
+ },
194
+ )
195
+ source_prefix: Optional[str] = field(
196
+ default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
197
+ )
198
+
199
+ forced_bos_token: Optional[str] = field(
200
+ default=None,
201
+ metadata={
202
+ "help": (
203
+ "The token to force as the first generated token after the decoder_start_token_id."
204
+ "Useful for multilingual models like mBART where the first generated token"
205
+ "needs to be the target language token (Usually it is the target language token)"
206
+ )
207
+ },
208
+ )
209
+
210
+
211
+
212
+ def __post_init__(self):
213
+ if self.dataset_name is None and self.train_file is None and self.validation_file is None and self.test_file is None:
214
+ raise ValueError("Need either a dataset name or a training/validation/test file.")
215
+ else:
216
+ if self.train_file is not None:
217
+ extension = self.train_file.split(".")[-1]
218
+ assert extension in ["csv", "json"], "`train_file` should be a csv or a json file."
219
+ if self.validation_file is not None:
220
+ extension = self.validation_file.split(".")[-1]
221
+ assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file."
222
+ if self.val_max_target_length is None:
223
+ self.val_max_target_length = self.max_target_length
224
+
ptuning/deepspeed.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "train_micro_batch_size_per_gpu": "auto",
3
+ "zero_allow_untested_optimizer": true,
4
+ "fp16": {
5
+ "enabled": "auto",
6
+ "loss_scale": 0,
7
+ "initial_scale_power": 16,
8
+ "loss_scale_window": 1000,
9
+ "hysteresis": 2,
10
+ "min_loss_scale": 1
11
+ },
12
+ "zero_optimization": {
13
+ "stage": 2,
14
+ "allgather_partitions": true,
15
+ "allgather_bucket_size": 5e8,
16
+ "overlap_comm": false,
17
+ "reduce_scatter": true,
18
+ "reduce_bucket_size": 5e8,
19
+ "contiguous_gradients" : true
20
+ }
21
+ }
ptuning/ds_train_finetune.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ LR=1e-4
3
+
4
+ MASTER_PORT=$(shuf -n 1 -i 10000-65535)
5
+
6
+ deepspeed --num_gpus=4 --master_port $MASTER_PORT main.py \
7
+ --deepspeed deepspeed.json \
8
+ --do_train \
9
+ --train_file AdvertiseGen/train.json \
10
+ --test_file AdvertiseGen/dev.json \
11
+ --prompt_column content \
12
+ --response_column summary \
13
+ --overwrite_cache \
14
+ --model_name_or_path THUDM/chatglm2-6b \
15
+ --output_dir ./output/adgen-chatglm2-6b-ft-$LR \
16
+ --overwrite_output_dir \
17
+ --max_source_length 64 \
18
+ --max_target_length 64 \
19
+ --per_device_train_batch_size 4 \
20
+ --per_device_eval_batch_size 1 \
21
+ --gradient_accumulation_steps 1 \
22
+ --predict_with_generate \
23
+ --max_steps 5000 \
24
+ --logging_steps 10 \
25
+ --save_steps 1000 \
26
+ --learning_rate $LR \
27
+ --fp16
28
+
ptuning/evaluate.sh ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PRE_SEQ_LEN=128
2
+ CHECKPOINT=adgen-chatglm2-6b-pt-128-2e-2
3
+ STEP=3000
4
+ NUM_GPUS=1
5
+
6
+ torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
7
+ --do_predict \
8
+ --validation_file AdvertiseGen/dev.json \
9
+ --test_file AdvertiseGen/dev.json \
10
+ --overwrite_cache \
11
+ --prompt_column content \
12
+ --response_column summary \
13
+ --model_name_or_path chatglm2-6b \
14
+ --ptuning_checkpoint ./output/$CHECKPOINT/checkpoint-$STEP \
15
+ --output_dir ./output/$CHECKPOINT \
16
+ --overwrite_output_dir \
17
+ --max_source_length 64 \
18
+ --max_target_length 64 \
19
+ --per_device_eval_batch_size 1 \
20
+ --predict_with_generate \
21
+ --pre_seq_len $PRE_SEQ_LEN \
22
+ --quantization_bit 4
ptuning/evaluate_finetune.sh ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ CHECKPOINT=adgen-chatglm2-6b-ft-1e-4
2
+ STEP=3000
3
+ NUM_GPUS=1
4
+
5
+ torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
6
+ --do_predict \
7
+ --validation_file AdvertiseGen/dev.json \
8
+ --test_file AdvertiseGen/dev.json \
9
+ --overwrite_cache \
10
+ --prompt_column content \
11
+ --response_column summary \
12
+ --model_name_or_path ./output/$CHECKPOINT/checkpoint-$STEP \
13
+ --output_dir ./output/$CHECKPOINT \
14
+ --overwrite_output_dir \
15
+ --max_source_length 256 \
16
+ --max_target_length 256 \
17
+ --per_device_eval_batch_size 1 \
18
+ --predict_with_generate \
19
+ --fp16_full_eval
ptuning/main.py ADDED
@@ -0,0 +1,411 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding=utf-8
3
+ # Copyright 2021 The HuggingFace Team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """
17
+ Fine-tuning the library models for sequence to sequence.
18
+ """
19
+ # You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
20
+
21
+ import logging
22
+ import os
23
+ import sys
24
+ import json
25
+
26
+ import numpy as np
27
+ from datasets import load_dataset
28
+ import jieba
29
+ from rouge_chinese import Rouge
30
+ from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
31
+ import torch
32
+
33
+ import transformers
34
+ from transformers import (
35
+ AutoConfig,
36
+ AutoModel,
37
+ AutoTokenizer,
38
+ DataCollatorForSeq2Seq,
39
+ HfArgumentParser,
40
+ Seq2SeqTrainingArguments,
41
+ set_seed,
42
+ )
43
+ from trainer_seq2seq import Seq2SeqTrainer
44
+
45
+ from arguments import ModelArguments, DataTrainingArguments
46
+
47
+ logger = logging.getLogger(__name__)
48
+
49
+ def main():
50
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
51
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
52
+ # If we pass only one argument to the script and it's the path to a json file,
53
+ # let's parse it to get our arguments.
54
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
55
+ else:
56
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
57
+
58
+ # Setup logging
59
+ logging.basicConfig(
60
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
61
+ datefmt="%m/%d/%Y %H:%M:%S",
62
+ handlers=[logging.StreamHandler(sys.stdout)],
63
+ )
64
+
65
+ if training_args.should_log:
66
+ # The default of training_args.log_level is passive, so we set log level at info here to have that default.
67
+ transformers.utils.logging.set_verbosity_info()
68
+
69
+ log_level = training_args.get_process_log_level()
70
+ logger.setLevel(log_level)
71
+ # datasets.utils.logging.set_verbosity(log_level)
72
+ transformers.utils.logging.set_verbosity(log_level)
73
+ transformers.utils.logging.enable_default_handler()
74
+ transformers.utils.logging.enable_explicit_format()
75
+
76
+ # Log on each process the small summary:
77
+ logger.warning(
78
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
79
+ + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
80
+ )
81
+ logger.info(f"Training/evaluation parameters {training_args}")
82
+
83
+ # Set seed before initializing model.
84
+ set_seed(training_args.seed)
85
+
86
+ # Load dataset
87
+ data_files = {}
88
+ if data_args.train_file is not None:
89
+ data_files["train"] = data_args.train_file
90
+ extension = data_args.train_file.split(".")[-1]
91
+ if data_args.validation_file is not None:
92
+ data_files["validation"] = data_args.validation_file
93
+ extension = data_args.validation_file.split(".")[-1]
94
+ if data_args.test_file is not None:
95
+ data_files["test"] = data_args.test_file
96
+ extension = data_args.test_file.split(".")[-1]
97
+
98
+ raw_datasets = load_dataset(
99
+ extension,
100
+ data_files=data_files,
101
+ cache_dir=model_args.cache_dir,
102
+ use_auth_token=True if model_args.use_auth_token else None,
103
+ )
104
+
105
+ # Load pretrained model and tokenizer
106
+ config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
107
+ config.pre_seq_len = model_args.pre_seq_len
108
+ config.prefix_projection = model_args.prefix_projection
109
+
110
+ tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
111
+
112
+ if model_args.ptuning_checkpoint is not None:
113
+ # Evaluation
114
+ # Loading extra state dict of prefix encoder
115
+ model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
116
+ prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
117
+ new_prefix_state_dict = {}
118
+ for k, v in prefix_state_dict.items():
119
+ if k.startswith("transformer.prefix_encoder."):
120
+ new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
121
+ model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
122
+ else:
123
+ model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
124
+
125
+ if model_args.quantization_bit is not None:
126
+ print(f"Quantized to {model_args.quantization_bit} bit")
127
+ model = model.quantize(model_args.quantization_bit)
128
+ if model_args.pre_seq_len is not None:
129
+ # P-tuning v2
130
+ model = model.half()
131
+ model.transformer.prefix_encoder.float()
132
+ else:
133
+ # Finetune
134
+ model = model.float()
135
+
136
+ prefix = data_args.source_prefix if data_args.source_prefix is not None else ""
137
+
138
+ # Preprocessing the datasets.
139
+ # We need to tokenize inputs and targets.
140
+ if training_args.do_train:
141
+ column_names = raw_datasets["train"].column_names
142
+ elif training_args.do_eval:
143
+ column_names = raw_datasets["validation"].column_names
144
+ elif training_args.do_predict:
145
+ column_names = raw_datasets["test"].column_names
146
+ else:
147
+ logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.")
148
+ return
149
+
150
+ # Get the column names for input/target.
151
+ prompt_column = data_args.prompt_column
152
+ response_column = data_args.response_column
153
+ history_column = data_args.history_column
154
+
155
+ # Temporarily set max_target_length for training.
156
+ max_target_length = data_args.max_target_length
157
+
158
+ def preprocess_function_eval(examples):
159
+ inputs, targets = [], []
160
+ for i in range(len(examples[prompt_column])):
161
+ if examples[prompt_column][i] and examples[response_column][i]:
162
+ query = examples[prompt_column][i]
163
+ history = examples[history_column][i] if history_column is not None else None
164
+ prompt = tokenizer.build_prompt(query, history)
165
+ inputs.append(prompt)
166
+ targets.append(examples[response_column][i])
167
+
168
+ inputs = [prefix + inp for inp in inputs]
169
+ model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, truncation=True, padding=True)
170
+ labels = tokenizer(text_target=targets, max_length=max_target_length, truncation=True)
171
+
172
+ if data_args.ignore_pad_token_for_loss:
173
+ labels["input_ids"] = [
174
+ [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
175
+ ]
176
+ model_inputs["labels"] = labels["input_ids"]
177
+
178
+ return model_inputs
179
+
180
+ def preprocess_function_train(examples):
181
+ max_seq_length = data_args.max_source_length + data_args.max_target_length
182
+
183
+ model_inputs = {
184
+ "input_ids": [],
185
+ "labels": [],
186
+ }
187
+ for i in range(len(examples[prompt_column])):
188
+ if examples[prompt_column][i] and examples[response_column][i]:
189
+ query, answer = examples[prompt_column][i], examples[response_column][i]
190
+
191
+ history = examples[history_column][i] if history_column is not None else None
192
+ prompt = tokenizer.build_prompt(query, history)
193
+
194
+ prompt = prefix + prompt
195
+ a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
196
+ max_length=data_args.max_source_length)
197
+ b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
198
+ max_length=data_args.max_target_length)
199
+
200
+ context_length = len(a_ids)
201
+ input_ids = a_ids + b_ids + [tokenizer.eos_token_id]
202
+ labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]
203
+
204
+ pad_len = max_seq_length - len(input_ids)
205
+ input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
206
+ labels = labels + [tokenizer.pad_token_id] * pad_len
207
+ if data_args.ignore_pad_token_for_loss:
208
+ labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
209
+
210
+ model_inputs["input_ids"].append(input_ids)
211
+ model_inputs["labels"].append(labels)
212
+
213
+ return model_inputs
214
+
215
+ def print_dataset_example(example):
216
+ print("input_ids", example["input_ids"])
217
+ print("inputs", tokenizer.decode(example["input_ids"]))
218
+ print("label_ids", example["labels"])
219
+ print("labels", tokenizer.decode(example["labels"]))
220
+
221
+ if training_args.do_train:
222
+ if "train" not in raw_datasets:
223
+ raise ValueError("--do_train requires a train dataset")
224
+ train_dataset = raw_datasets["train"]
225
+ if data_args.max_train_samples is not None:
226
+ max_train_samples = min(len(train_dataset), data_args.max_train_samples)
227
+ train_dataset = train_dataset.select(range(max_train_samples))
228
+ with training_args.main_process_first(desc="train dataset map pre-processing"):
229
+ train_dataset = train_dataset.map(
230
+ preprocess_function_train,
231
+ batched=True,
232
+ num_proc=data_args.preprocessing_num_workers,
233
+ remove_columns=column_names,
234
+ load_from_cache_file=not data_args.overwrite_cache,
235
+ desc="Running tokenizer on train dataset",
236
+ )
237
+ print_dataset_example(train_dataset[0])
238
+
239
+ if training_args.do_eval:
240
+ max_target_length = data_args.val_max_target_length
241
+ if "validation" not in raw_datasets:
242
+ raise ValueError("--do_eval requires a validation dataset")
243
+ eval_dataset = raw_datasets["validation"]
244
+ if data_args.max_eval_samples is not None:
245
+ max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples)
246
+ eval_dataset = eval_dataset.select(range(max_eval_samples))
247
+ with training_args.main_process_first(desc="validation dataset map pre-processing"):
248
+ eval_dataset = eval_dataset.map(
249
+ preprocess_function_eval,
250
+ batched=True,
251
+ num_proc=data_args.preprocessing_num_workers,
252
+ remove_columns=column_names,
253
+ load_from_cache_file=not data_args.overwrite_cache,
254
+ desc="Running tokenizer on validation dataset",
255
+ )
256
+ print_dataset_example(eval_dataset[0])
257
+
258
+ if training_args.do_predict:
259
+ max_target_length = data_args.val_max_target_length
260
+ if "test" not in raw_datasets:
261
+ raise ValueError("--do_predict requires a test dataset")
262
+ predict_dataset = raw_datasets["test"]
263
+ if data_args.max_predict_samples is not None:
264
+ max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples)
265
+ predict_dataset = predict_dataset.select(range(max_predict_samples))
266
+ with training_args.main_process_first(desc="prediction dataset map pre-processing"):
267
+ predict_dataset = predict_dataset.map(
268
+ preprocess_function_eval,
269
+ batched=True,
270
+ num_proc=data_args.preprocessing_num_workers,
271
+ remove_columns=column_names,
272
+ load_from_cache_file=not data_args.overwrite_cache,
273
+ desc="Running tokenizer on prediction dataset",
274
+ )
275
+ print_dataset_example(predict_dataset[0])
276
+
277
+ # Data collator
278
+ label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id
279
+ data_collator = DataCollatorForSeq2Seq(
280
+ tokenizer,
281
+ model=model,
282
+ label_pad_token_id=label_pad_token_id,
283
+ pad_to_multiple_of=None,
284
+ padding=False
285
+ )
286
+
287
+ # Metric
288
+ def compute_metrics(eval_preds):
289
+ preds, labels = eval_preds
290
+ if isinstance(preds, tuple):
291
+ preds = preds[0]
292
+ decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
293
+ if data_args.ignore_pad_token_for_loss:
294
+ # Replace -100 in the labels as we can't decode them.
295
+ labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
296
+ decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
297
+
298
+ score_dict = {
299
+ "rouge-1": [],
300
+ "rouge-2": [],
301
+ "rouge-l": [],
302
+ "bleu-4": []
303
+ }
304
+ for pred, label in zip(decoded_preds, decoded_labels):
305
+ hypothesis = list(jieba.cut(pred))
306
+ reference = list(jieba.cut(label))
307
+ rouge = Rouge()
308
+ scores = rouge.get_scores(' '.join(hypothesis) , ' '.join(reference))
309
+ result = scores[0]
310
+
311
+ for k, v in result.items():
312
+ score_dict[k].append(round(v["f"] * 100, 4))
313
+ bleu_score = sentence_bleu([list(label)], list(pred), smoothing_function=SmoothingFunction().method3)
314
+ score_dict["bleu-4"].append(round(bleu_score * 100, 4))
315
+
316
+ for k, v in score_dict.items():
317
+ score_dict[k] = float(np.mean(v))
318
+ return score_dict
319
+
320
+ # Override the decoding parameters of Seq2SeqTrainer
321
+ training_args.generation_max_length = (
322
+ training_args.generation_max_length
323
+ if training_args.generation_max_length is not None
324
+ else data_args.val_max_target_length
325
+ )
326
+ training_args.generation_num_beams = (
327
+ data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams
328
+ )
329
+ # Initialize our Trainer
330
+ trainer = Seq2SeqTrainer(
331
+ model=model,
332
+ args=training_args,
333
+ train_dataset=train_dataset if training_args.do_train else None,
334
+ eval_dataset=eval_dataset if training_args.do_eval else None,
335
+ tokenizer=tokenizer,
336
+ data_collator=data_collator,
337
+ compute_metrics=compute_metrics if training_args.predict_with_generate else None,
338
+ save_prefixencoder=model_args.pre_seq_len is not None
339
+ )
340
+
341
+ # Training
342
+ if training_args.do_train:
343
+ checkpoint = None
344
+ if training_args.resume_from_checkpoint is not None:
345
+ checkpoint = training_args.resume_from_checkpoint
346
+ # elif last_checkpoint is not None:
347
+ # checkpoint = last_checkpoint
348
+ model.gradient_checkpointing_enable()
349
+ model.enable_input_require_grads()
350
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
351
+ # trainer.save_model() # Saves the tokenizer too for easy upload
352
+
353
+ metrics = train_result.metrics
354
+ max_train_samples = (
355
+ data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
356
+ )
357
+ metrics["train_samples"] = min(max_train_samples, len(train_dataset))
358
+
359
+ trainer.log_metrics("train", metrics)
360
+ trainer.save_metrics("train", metrics)
361
+ trainer.save_state()
362
+
363
+ # Evaluation
364
+ results = {}
365
+ max_seq_length = data_args.max_source_length + data_args.max_target_length + 1
366
+ if training_args.do_eval:
367
+ logger.info("*** Evaluate ***")
368
+ metrics = trainer.evaluate(metric_key_prefix="eval", do_sample=True, top_p=0.7, max_length=max_seq_length, temperature=0.95)
369
+ max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
370
+ metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
371
+
372
+ trainer.log_metrics("eval", metrics)
373
+ trainer.save_metrics("eval", metrics)
374
+
375
+ if training_args.do_predict:
376
+ logger.info("*** Predict ***")
377
+ predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict", max_length=max_seq_length, do_sample=True, top_p=0.7, temperature=0.95)
378
+ metrics = predict_results.metrics
379
+ max_predict_samples = (
380
+ data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset)
381
+ )
382
+ metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset))
383
+
384
+ trainer.log_metrics("predict", metrics)
385
+ trainer.save_metrics("predict", metrics)
386
+
387
+ if trainer.is_world_process_zero():
388
+ if training_args.predict_with_generate:
389
+ predictions = tokenizer.batch_decode(
390
+ predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
391
+ )
392
+ predictions = [pred.strip() for pred in predictions]
393
+ labels = tokenizer.batch_decode(
394
+ predict_results.label_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
395
+ )
396
+ labels = [label.strip() for label in labels]
397
+ output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt")
398
+ with open(output_prediction_file, "w", encoding="utf-8") as writer:
399
+ for p, l in zip(predictions, labels):
400
+ res = json.dumps({"labels": l, "predict": p}, ensure_ascii=False)
401
+ writer.write(f"{res}\n")
402
+ return results
403
+
404
+
405
+ def _mp_fn(index):
406
+ # For xla_spawn (TPUs)
407
+ main()
408
+
409
+
410
+ if __name__ == "__main__":
411
+ main()
ptuning/train.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PRE_SEQ_LEN=128
2
+ LR=2e-2
3
+ NUM_GPUS=1
4
+
5
+ torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
6
+ --do_train \
7
+ --train_file AdvertiseGen/train.json \
8
+ --validation_file AdvertiseGen/dev.json \
9
+ --preprocessing_num_workers 10 \
10
+ --prompt_column content \
11
+ --response_column summary \
12
+ --overwrite_cache \
13
+ --model_name_or_path THUDM/chatglm2-6b \
14
+ --output_dir output/adgen-chatglm2-6b-pt-$PRE_SEQ_LEN-$LR \
15
+ --overwrite_output_dir \
16
+ --max_source_length 64 \
17
+ --max_target_length 128 \
18
+ --per_device_train_batch_size 1 \
19
+ --per_device_eval_batch_size 1 \
20
+ --gradient_accumulation_steps 16 \
21
+ --predict_with_generate \
22
+ --max_steps 3000 \
23
+ --logging_steps 10 \
24
+ --save_steps 1000 \
25
+ --learning_rate $LR \
26
+ --pre_seq_len $PRE_SEQ_LEN \
27
+ --quantization_bit 4
28
+
ptuning/train_chat.sh ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ PRE_SEQ_LEN=128
2
+ LR=1e-2
3
+ NUM_GPUS=1
4
+
5
+ torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \
6
+ --do_train \
7
+ --train_file $CHAT_TRAIN_DATA \
8
+ --validation_file $CHAT_VAL_DATA \
9
+ --preprocessing_num_workers 10 \
10
+ --prompt_column prompt \
11
+ --response_column response \
12
+ --history_column history \
13
+ --overwrite_cache \
14
+ --model_name_or_path THUDM/chatglm2-6b \
15
+ --output_dir $CHECKPOINT_NAME \
16
+ --overwrite_output_dir \
17
+ --max_source_length 256 \
18
+ --max_target_length 256 \
19
+ --per_device_train_batch_size 1 \
20
+ --per_device_eval_batch_size 1 \
21
+ --gradient_accumulation_steps 16 \
22
+ --predict_with_generate \
23
+ --max_steps 3000 \
24
+ --logging_steps 10 \
25
+ --save_steps 1000 \
26
+ --learning_rate $LR \
27
+ --pre_seq_len $PRE_SEQ_LEN \
28
+ --quantization_bit 4
29
+
ptuning/trainer.py ADDED
The diff for this file is too large to render. See raw diff
 
ptuning/trainer_seq2seq.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ from typing import Any, Dict, List, Optional, Tuple, Union
16
+
17
+ import torch
18
+ from torch import nn
19
+ from torch.utils.data import Dataset
20
+
21
+ from transformers.deepspeed import is_deepspeed_zero3_enabled
22
+ from trainer import Trainer
23
+ from transformers.trainer_utils import PredictionOutput
24
+ from transformers.utils import logging
25
+
26
+
27
+ logger = logging.get_logger(__name__)
28
+
29
+
30
+ class Seq2SeqTrainer(Trainer):
31
+ def evaluate(
32
+ self,
33
+ eval_dataset: Optional[Dataset] = None,
34
+ ignore_keys: Optional[List[str]] = None,
35
+ metric_key_prefix: str = "eval",
36
+ **gen_kwargs
37
+ ) -> Dict[str, float]:
38
+ """
39
+ Run evaluation and returns metrics.
40
+
41
+ The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
42
+ (pass it to the init `compute_metrics` argument).
43
+
44
+ You can also subclass and override this method to inject custom behavior.
45
+
46
+ Args:
47
+ eval_dataset (`Dataset`, *optional*):
48
+ Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
49
+ not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
50
+ method.
51
+ ignore_keys (`List[str]`, *optional*):
52
+ A list of keys in the output of your model (if it is a dictionary) that should be ignored when
53
+ gathering predictions.
54
+ metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
55
+ An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
56
+ "eval_bleu" if the prefix is `"eval"` (default)
57
+ max_length (`int`, *optional*):
58
+ The maximum target length to use when predicting with the generate method.
59
+ num_beams (`int`, *optional*):
60
+ Number of beams for beam search that will be used when predicting with the generate method. 1 means no
61
+ beam search.
62
+ gen_kwargs:
63
+ Additional `generate` specific kwargs.
64
+
65
+ Returns:
66
+ A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
67
+ dictionary also contains the epoch number which comes from the training state.
68
+ """
69
+
70
+ gen_kwargs = gen_kwargs.copy()
71
+ if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
72
+ gen_kwargs["max_length"] = self.args.generation_max_length
73
+ gen_kwargs["num_beams"] = (
74
+ gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
75
+ )
76
+ self._gen_kwargs = gen_kwargs
77
+
78
+ return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
79
+
80
+ def predict(
81
+ self,
82
+ test_dataset: Dataset,
83
+ ignore_keys: Optional[List[str]] = None,
84
+ metric_key_prefix: str = "test",
85
+ **gen_kwargs
86
+ ) -> PredictionOutput:
87
+ """
88
+ Run prediction and returns predictions and potential metrics.
89
+
90
+ Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method
91
+ will also return metrics, like in `evaluate()`.
92
+
93
+ Args:
94
+ test_dataset (`Dataset`):
95
+ Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
96
+ `model.forward()` method are automatically removed. Has to implement the method `__len__`
97
+ ignore_keys (`List[str]`, *optional*):
98
+ A list of keys in the output of your model (if it is a dictionary) that should be ignored when
99
+ gathering predictions.
100
+ metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
101
+ An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
102
+ "eval_bleu" if the prefix is `"eval"` (default)
103
+ max_length (`int`, *optional*):
104
+ The maximum target length to use when predicting with the generate method.
105
+ num_beams (`int`, *optional*):
106
+ Number of beams for beam search that will be used when predicting with the generate method. 1 means no
107
+ beam search.
108
+ gen_kwargs:
109
+ Additional `generate` specific kwargs.
110
+
111
+ <Tip>
112
+
113
+ If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
114
+ padding in a token classification task) the predictions will be padded (on the right) to allow for
115
+ concatenation into one array. The padding index is -100.
116
+
117
+ </Tip>
118
+
119
+ Returns: *NamedTuple* A namedtuple with the following keys:
120
+
121
+ - predictions (`np.ndarray`): The predictions on `test_dataset`.
122
+ - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some).
123
+ - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained
124
+ labels).
125
+ """
126
+
127
+ gen_kwargs = gen_kwargs.copy()
128
+ if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
129
+ gen_kwargs["max_length"] = self.args.generation_max_length
130
+ gen_kwargs["num_beams"] = (
131
+ gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
132
+ )
133
+ self._gen_kwargs = gen_kwargs
134
+
135
+
136
+ return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
137
+
138
+ def prediction_step(
139
+ self,
140
+ model: nn.Module,
141
+ inputs: Dict[str, Union[torch.Tensor, Any]],
142
+ prediction_loss_only: bool,
143
+ ignore_keys: Optional[List[str]] = None,
144
+ ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
145
+ """
146
+ Perform an evaluation step on `model` using `inputs`.
147
+
148
+ Subclass and override to inject custom behavior.
149
+
150
+ Args:
151
+ model (`nn.Module`):
152
+ The model to evaluate.
153
+ inputs (`Dict[str, Union[torch.Tensor, Any]]`):
154
+ The inputs and targets of the model.
155
+
156
+ The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
157
+ argument `labels`. Check your model's documentation for all accepted arguments.
158
+ prediction_loss_only (`bool`):
159
+ Whether or not to return the loss only.
160
+
161
+ Return:
162
+ Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and
163
+ labels (each being optional).
164
+ """
165
+
166
+ if not self.args.predict_with_generate or prediction_loss_only:
167
+ return super().prediction_step(
168
+ model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
169
+ )
170
+
171
+ has_labels = "labels" in inputs
172
+ inputs = self._prepare_inputs(inputs)
173
+
174
+ # XXX: adapt synced_gpus for fairscale as well
175
+ gen_kwargs = self._gen_kwargs.copy()
176
+ if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
177
+ gen_kwargs["max_length"] = self.model.config.max_length
178
+ gen_kwargs["num_beams"] = (
179
+ gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
180
+ )
181
+ default_synced_gpus = True if is_deepspeed_zero3_enabled() else False
182
+ gen_kwargs["synced_gpus"] = (
183
+ gen_kwargs["synced_gpus"] if gen_kwargs.get("synced_gpus") is not None else default_synced_gpus
184
+ )
185
+
186
+ if "attention_mask" in inputs:
187
+ gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
188
+ if "position_ids" in inputs:
189
+ gen_kwargs["position_ids"] = inputs.get("position_ids", None)
190
+ if "global_attention_mask" in inputs:
191
+ gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)
192
+
193
+ # prepare generation inputs
194
+ # some encoder-decoder models can have varying encoder's and thus
195
+ # varying model input names
196
+ if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
197
+ generation_inputs = inputs[self.model.encoder.main_input_name]
198
+ else:
199
+ generation_inputs = inputs[self.model.main_input_name]
200
+
201
+ gen_kwargs["input_ids"] = generation_inputs
202
+ generated_tokens = self.model.generate(**gen_kwargs)
203
+ generated_tokens = generated_tokens[:, generation_inputs.size()[-1]:]
204
+
205
+ # in case the batch is shorter than max length, the output should be padded
206
+ if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
207
+ generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
208
+ elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
209
+ gen_kwargs["max_new_tokens"] + 1
210
+ ):
211
+ generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)
212
+
213
+ loss = None
214
+
215
+ if self.args.prediction_loss_only:
216
+ return (loss, None, None)
217
+
218
+ if has_labels:
219
+ labels = inputs["labels"]
220
+ if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
221
+ labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
222
+ elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
223
+ gen_kwargs["max_new_tokens"] + 1
224
+ ):
225
+ labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
226
+ else:
227
+ labels = None
228
+
229
+ return (loss, generated_tokens, labels)
230
+
231
+ def _pad_tensors_to_max_len(self, tensor, max_length):
232
+ if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
233
+ # If PAD token is not defined at least EOS token has to be defined
234
+ pad_token_id = (
235
+ self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
236
+ )
237
+ else:
238
+ if self.model.config.pad_token_id is not None:
239
+ pad_token_id = self.model.config.pad_token_id
240
+ else:
241
+ raise ValueError("Pad_token_id must be set in the configuration of the model, in order to pad tensors")
242
+
243
+ padded_tensor = pad_token_id * torch.ones(
244
+ (tensor.shape[0], max_length), dtype=tensor.dtype, device=tensor.device
245
+ )
246
+ padded_tensor[:, : tensor.shape[-1]] = tensor
247
+ return padded_tensor
ptuning/web_demo.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os, sys
2
+
3
+ import gradio as gr
4
+ import mdtex2html
5
+
6
+ import torch
7
+ import transformers
8
+ from transformers import (
9
+ AutoConfig,
10
+ AutoModel,
11
+ AutoTokenizer,
12
+ AutoTokenizer,
13
+ DataCollatorForSeq2Seq,
14
+ HfArgumentParser,
15
+ Seq2SeqTrainingArguments,
16
+ set_seed,
17
+ )
18
+
19
+ from arguments import ModelArguments, DataTrainingArguments
20
+
21
+
22
+ model = None
23
+ tokenizer = None
24
+
25
+ """Override Chatbot.postprocess"""
26
+
27
+
28
+ def postprocess(self, y):
29
+ if y is None:
30
+ return []
31
+ for i, (message, response) in enumerate(y):
32
+ y[i] = (
33
+ None if message is None else mdtex2html.convert((message)),
34
+ None if response is None else mdtex2html.convert(response),
35
+ )
36
+ return y
37
+
38
+
39
+ gr.Chatbot.postprocess = postprocess
40
+
41
+
42
+ def parse_text(text):
43
+ """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/"""
44
+ lines = text.split("\n")
45
+ lines = [line for line in lines if line != ""]
46
+ count = 0
47
+ for i, line in enumerate(lines):
48
+ if "```" in line:
49
+ count += 1
50
+ items = line.split('`')
51
+ if count % 2 == 1:
52
+ lines[i] = f'<pre><code class="language-{items[-1]}">'
53
+ else:
54
+ lines[i] = f'<br></code></pre>'
55
+ else:
56
+ if i > 0:
57
+ if count % 2 == 1:
58
+ line = line.replace("`", "\`")
59
+ line = line.replace("<", "&lt;")
60
+ line = line.replace(">", "&gt;")
61
+ line = line.replace(" ", "&nbsp;")
62
+ line = line.replace("*", "&ast;")
63
+ line = line.replace("_", "&lowbar;")
64
+ line = line.replace("-", "&#45;")
65
+ line = line.replace(".", "&#46;")
66
+ line = line.replace("!", "&#33;")
67
+ line = line.replace("(", "&#40;")
68
+ line = line.replace(")", "&#41;")
69
+ line = line.replace("$", "&#36;")
70
+ lines[i] = "<br>"+line
71
+ text = "".join(lines)
72
+ return text
73
+
74
+
75
+ def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):
76
+ chatbot.append((parse_text(input), ""))
77
+ for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values,
78
+ return_past_key_values=True,
79
+ max_length=max_length, top_p=top_p,
80
+ temperature=temperature):
81
+ chatbot[-1] = (parse_text(input), parse_text(response))
82
+
83
+ yield chatbot, history, past_key_values
84
+
85
+
86
+ def reset_user_input():
87
+ return gr.update(value='')
88
+
89
+
90
+ def reset_state():
91
+ return [], [], None
92
+
93
+
94
+ with gr.Blocks() as demo:
95
+ gr.HTML("""<h1 align="center">ChatGLM2-6B</h1>""")
96
+
97
+ chatbot = gr.Chatbot()
98
+ with gr.Row():
99
+ with gr.Column(scale=4):
100
+ with gr.Column(scale=12):
101
+ user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
102
+ container=False)
103
+ with gr.Column(min_width=32, scale=1):
104
+ submitBtn = gr.Button("Submit", variant="primary")
105
+ with gr.Column(scale=1):
106
+ emptyBtn = gr.Button("Clear History")
107
+ max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)
108
+ top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)
109
+ temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
110
+
111
+ history = gr.State([])
112
+ past_key_values = gr.State(None)
113
+
114
+ submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],
115
+ [chatbot, history, past_key_values], show_progress=True)
116
+ submitBtn.click(reset_user_input, [], [user_input])
117
+
118
+ emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True)
119
+
120
+
121
+ def main():
122
+ global model, tokenizer
123
+
124
+ parser = HfArgumentParser((
125
+ ModelArguments))
126
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
127
+ # If we pass only one argument to the script and it's the path to a json file,
128
+ # let's parse it to get our arguments.
129
+ model_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))[0]
130
+ else:
131
+ model_args = parser.parse_args_into_dataclasses()[0]
132
+
133
+ tokenizer = AutoTokenizer.from_pretrained(
134
+ model_args.model_name_or_path, trust_remote_code=True)
135
+ config = AutoConfig.from_pretrained(
136
+ model_args.model_name_or_path, trust_remote_code=True)
137
+
138
+ config.pre_seq_len = model_args.pre_seq_len
139
+ config.prefix_projection = model_args.prefix_projection
140
+
141
+ if model_args.ptuning_checkpoint is not None:
142
+ print(f"Loading prefix_encoder weight from {model_args.ptuning_checkpoint}")
143
+ model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
144
+ prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
145
+ new_prefix_state_dict = {}
146
+ for k, v in prefix_state_dict.items():
147
+ if k.startswith("transformer.prefix_encoder."):
148
+ new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
149
+ model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
150
+ else:
151
+ model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
152
+
153
+ if model_args.quantization_bit is not None:
154
+ print(f"Quantized to {model_args.quantization_bit} bit")
155
+ model = model.quantize(model_args.quantization_bit)
156
+ model = model.cuda()
157
+ if model_args.pre_seq_len is not None:
158
+ # P-tuning v2
159
+ model.transformer.prefix_encoder.float()
160
+
161
+ model = model.eval()
162
+ demo.queue().launch(share=False, inbrowser=True)
163
+
164
+
165
+
166
+ if __name__ == "__main__":
167
+ main()
ptuning/web_demo.sh ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ PRE_SEQ_LEN=128
2
+
3
+ CUDA_VISIBLE_DEVICES=0 python3 web_demo.py \
4
+ --model_name_or_path THUDM/chatglm2-6b \
5
+ --ptuning_checkpoint output/adgen-chatglm2-6b-pt-128-2e-2/checkpoint-3000 \
6
+ --pre_seq_len $PRE_SEQ_LEN
7
+
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ protobuf
2
+ transformers==4.30.2
3
+ cpm_kernels
4
+ torch>=2.0
5
+ gradio
6
+ mdtex2html
7
+ sentencepiece
8
+ accelerate
9
+ sse-starlette
resources/WECHAT.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src=wechat.jpg width="60%"/>
3
+
4
+ <p> 扫码关注公众号,加入「ChatGLM交流群」 </p>
5
+ <p> Scan the QR code to follow the official account and join the "ChatGLM Discussion Group" </p>
6
+ </div>
7
+
resources/cli-demo.png ADDED
resources/knowledge.png ADDED
resources/long-context.png ADDED

Git LFS Details

  • SHA256: 9df24161083739a775aa47abeb53a95ab066ad498192d061b0a4941fcc74f35c
  • Pointer size: 132 Bytes
  • Size of remote file: 1.11 MB
resources/math.png ADDED
resources/web-demo.gif ADDED

Git LFS Details

  • SHA256: ba8ff042bbd879cbb4dd3795081b2e4e3713d3a4d2d5d7d61a027c389324cbbc
  • Pointer size: 132 Bytes
  • Size of remote file: 2.28 MB
resources/web-demo.png ADDED
resources/wechat.jpg ADDED
utils.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import Dict, Tuple, Union, Optional
3
+
4
+ from torch.nn import Module
5
+ from transformers import AutoModel
6
+
7
+
8
+ def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
9
+ # transformer.word_embeddings 占用1层
10
+ # transformer.final_layernorm 和 lm_head 占用1层
11
+ # transformer.layers 占用 28 层
12
+ # 总共30层分配到num_gpus张卡上
13
+ num_trans_layers = 28
14
+ per_gpu_layers = 30 / num_gpus
15
+
16
+ # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
17
+ # windows下 model.device 会被设置成 transformer.word_embeddings.device
18
+ # linux下 model.device 会被设置成 lm_head.device
19
+ # 在调用chat或者stream_chat时,input_ids会被放到model.device上
20
+ # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
21
+ # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
22
+ # 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py
23
+ # 仅此处做少许修改以支持ChatGLM2
24
+ device_map = {
25
+ 'transformer.embedding.word_embeddings': 0,
26
+ 'transformer.encoder.final_layernorm': 0,
27
+ 'transformer.output_layer': 0,
28
+ 'transformer.rotary_pos_emb': 0,
29
+ 'lm_head': 0
30
+ }
31
+
32
+ used = 2
33
+ gpu_target = 0
34
+ for i in range(num_trans_layers):
35
+ if used >= per_gpu_layers:
36
+ gpu_target += 1
37
+ used = 0
38
+ assert gpu_target < num_gpus
39
+ device_map[f'transformer.encoder.layers.{i}'] = gpu_target
40
+ used += 1
41
+
42
+ return device_map
43
+
44
+
45
+ def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
46
+ device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
47
+ if num_gpus < 2 and device_map is None:
48
+ model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
49
+ else:
50
+ from accelerate import dispatch_model
51
+
52
+ model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()
53
+
54
+ if device_map is None:
55
+ device_map = auto_configure_device_map(num_gpus)
56
+
57
+ model = dispatch_model(model, device_map=device_map)
58
+
59
+ return model
web_demo.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModel, AutoTokenizer
2
+ import gradio as gr
3
+ import mdtex2html
4
+ from utils import load_model_on_gpus
5
+
6
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
7
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
8
+ # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
9
+ # from utils import load_model_on_gpus
10
+ # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
11
+ model = model.eval()
12
+
13
+ """Override Chatbot.postprocess"""
14
+
15
+
16
+ def postprocess(self, y):
17
+ if y is None:
18
+ return []
19
+ for i, (message, response) in enumerate(y):
20
+ y[i] = (
21
+ None if message is None else mdtex2html.convert((message)),
22
+ None if response is None else mdtex2html.convert(response),
23
+ )
24
+ return y
25
+
26
+
27
+ gr.Chatbot.postprocess = postprocess
28
+
29
+
30
+ def parse_text(text):
31
+ """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/"""
32
+ lines = text.split("\n")
33
+ lines = [line for line in lines if line != ""]
34
+ count = 0
35
+ for i, line in enumerate(lines):
36
+ if "```" in line:
37
+ count += 1
38
+ items = line.split('`')
39
+ if count % 2 == 1:
40
+ lines[i] = f'<pre><code class="language-{items[-1]}">'
41
+ else:
42
+ lines[i] = f'<br></code></pre>'
43
+ else:
44
+ if i > 0:
45
+ if count % 2 == 1:
46
+ line = line.replace("`", "\`")
47
+ line = line.replace("<", "&lt;")
48
+ line = line.replace(">", "&gt;")
49
+ line = line.replace(" ", "&nbsp;")
50
+ line = line.replace("*", "&ast;")
51
+ line = line.replace("_", "&lowbar;")
52
+ line = line.replace("-", "&#45;")
53
+ line = line.replace(".", "&#46;")
54
+ line = line.replace("!", "&#33;")
55
+ line = line.replace("(", "&#40;")
56
+ line = line.replace(")", "&#41;")
57
+ line = line.replace("$", "&#36;")
58
+ lines[i] = "<br>"+line
59
+ text = "".join(lines)
60
+ return text
61
+
62
+
63
+ def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):
64
+ chatbot.append((parse_text(input), ""))
65
+ for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values,
66
+ return_past_key_values=True,
67
+ max_length=max_length, top_p=top_p,
68
+ temperature=temperature):
69
+ chatbot[-1] = (parse_text(input), parse_text(response))
70
+
71
+ yield chatbot, history, past_key_values
72
+
73
+
74
+ def reset_user_input():
75
+ return gr.update(value='')
76
+
77
+
78
+ def reset_state():
79
+ return [], [], None
80
+
81
+
82
+ with gr.Blocks() as demo:
83
+ gr.HTML("""<h1 align="center">ChatGLM2-6B</h1>""")
84
+
85
+ chatbot = gr.Chatbot()
86
+ with gr.Row():
87
+ with gr.Column(scale=4):
88
+ with gr.Column(scale=12):
89
+ user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
90
+ container=False)
91
+ with gr.Column(min_width=32, scale=1):
92
+ submitBtn = gr.Button("Submit", variant="primary")
93
+ with gr.Column(scale=1):
94
+ emptyBtn = gr.Button("Clear History")
95
+ max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)
96
+ top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)
97
+ temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
98
+
99
+ history = gr.State([])
100
+ past_key_values = gr.State(None)
101
+
102
+ submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],
103
+ [chatbot, history, past_key_values], show_progress=True)
104
+ submitBtn.click(reset_user_input, [], [user_input])
105
+
106
+ emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True)
107
+
108
+ #demo.queue().launch(share=False, inbrowser=True)
109
+ demo.queue().launch(share=True,inbrowser=True)
110
+
web_demo2.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModel, AutoTokenizer
2
+ import streamlit as st
3
+ from streamlit_chat import message
4
+
5
+
6
+ st.set_page_config(
7
+ page_title="ChatGLM2-6b 演示",
8
+ page_icon=":robot:",
9
+ layout='wide'
10
+ )
11
+
12
+
13
+ @st.cache_resource
14
+ def get_model():
15
+ tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
16
+ model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
17
+ # 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
18
+ # from utils import load_model_on_gpus
19
+ # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
20
+ model = model.eval()
21
+ return tokenizer, model
22
+
23
+
24
+ MAX_TURNS = 20
25
+ MAX_BOXES = MAX_TURNS * 2
26
+
27
+
28
+ def predict(input, max_length, top_p, temperature, history=None):
29
+ tokenizer, model = get_model()
30
+ if history is None:
31
+ history = []
32
+
33
+ with container:
34
+ if len(history) > 0:
35
+ if len(history)>MAX_BOXES:
36
+ history = history[-MAX_TURNS:]
37
+ for i, (query, response) in enumerate(history):
38
+ message(query, avatar_style="big-smile", key=str(i) + "_user")
39
+ message(response, avatar_style="bottts", key=str(i))
40
+
41
+ message(input, avatar_style="big-smile", key=str(len(history)) + "_user")
42
+ st.write("AI正在回复:")
43
+ with st.empty():
44
+ for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
45
+ temperature=temperature):
46
+ query, response = history[-1]
47
+ st.write(response)
48
+
49
+ return history
50
+
51
+
52
+ container = st.container()
53
+
54
+ # create a prompt text for the text generation
55
+ prompt_text = st.text_area(label="用户命令输入",
56
+ height = 100,
57
+ placeholder="请在这儿输入您的命令")
58
+
59
+ max_length = st.sidebar.slider(
60
+ 'max_length', 0, 32768, 8192, step=1
61
+ )
62
+ top_p = st.sidebar.slider(
63
+ 'top_p', 0.0, 1.0, 0.8, step=0.01
64
+ )
65
+ temperature = st.sidebar.slider(
66
+ 'temperature', 0.0, 1.0, 0.95, step=0.01
67
+ )
68
+
69
+ if 'state' not in st.session_state:
70
+ st.session_state['state'] = []
71
+
72
+ if st.button("发送", key="predict"):
73
+ with st.spinner("AI正在思考,请稍等........"):
74
+ # text generation
75
+ st.session_state["state"] = predict(prompt_text, max_length, top_p, temperature, st.session_state["state"])