ChatGLM2-6B

Runtime error

App Files Files Community

foghuang commited on Jul 1, 2023

Commit

1cf9214

•

1 Parent(s): 950b41c

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

.gitattributes +2 -0
.github/ISSUE_TEMPLATE/bug_report.yaml +63 -0
.github/ISSUE_TEMPLATE/config.yml +1 -0
.github/ISSUE_TEMPLATE/feature_request.yml +26 -0
FAQ.md +15 -0
MODEL_LICENSE +65 -0
README.md +336 -8
README_EN.md +256 -0
api.py +60 -0
cli_demo.py +60 -0
evaluation/README.md +10 -0
evaluation/evaluate_ceval.py +60 -0
openai_api.py +174 -0
requirements.txt +8 -0
resources/WECHAT.md +7 -0
resources/cli-demo.png +0 -0
resources/knowledge.png +0 -0
resources/long-context.png +3 -0
resources/math.png +0 -0
resources/web-demo.gif +3 -0
resources/web-demo.png +0 -0
resources/wechat.jpg +0 -0
utils.py +59 -0
web_demo.py +108 -0
web_demo2.py +75 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+resources/long-context.png filter=lfs diff=lfs merge=lfs -text
+resources/web-demo.gif filter=lfs diff=lfs merge=lfs -text

.github/ISSUE_TEMPLATE/bug_report.yaml ADDED Viewed

	@@ -0,0 +1,63 @@

+name: 🐞 Bug/Help
+description: File a bug/issue
+title: "[BUG/Help] <title>"
+labels: []
+body:
+- type: checkboxes
+  attributes:
+    label: Is there an existing issue for this?
+    description: Please search to see if an issue already exists for the bug you encountered.
+    options:
+    - label: I have searched the existing issues
+      required: true
+- type: textarea
+  attributes:
+    label: Current Behavior
+    description: |
+      A concise description of what you're experiencing, with screenshot attached if possible.
+      Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Expected Behavior
+    description: A concise description of what you expected to happen.
+  validations:
+    required: false
+- type: textarea
+  attributes:
+    label: Steps To Reproduce
+    description: Steps to reproduce the behavior.
+    placeholder: |
+      1. In this environment...
+      2. With this config...
+      3. Run '...'
+      4. See error...
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Environment
+    description: |
+      examples:
+        - **OS**: Ubuntu 20.04
+        - **Python**: 3.8
+        - **Transformers**: 4.26.1
+        - **PyTorch**: 1.12
+        - **CUDA Support**: True
+    value: |
+        - OS:
+        - Python:
+        - Transformers:
+        - PyTorch:
+        - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :
+    render: markdown
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Anything else?
+    description: |
+      Links? References? Anything that will give us more context about the issue you are encountering!
+  validations:
+    required: false

.github/ISSUE_TEMPLATE/config.yml ADDED Viewed

	@@ -0,0 +1 @@


1	+ blank_issues_enabled: false

.github/ISSUE_TEMPLATE/feature_request.yml ADDED Viewed

	@@ -0,0 +1,26 @@

+name: Feature request
+description: Suggest an idea for this project
+title: "[Feature] <title>"
+labels: []
+body:
+- type: textarea
+  attributes:
+    label: Is your feature request related to a problem? Please describe.
+    description: |
+      A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+  validations:
+    required: false
+- type: textarea
+  attributes:
+    label: Solutions
+    description: |
+      Describe the solution you'd like
+      A clear and concise description of what you want to happen.
+  validations:
+    required: true
+- type: textarea
+  attributes:
+    label: Additional context
+    description: Add any other context or screenshots about the feature request here.
+  validations:
+    required: false

FAQ.md ADDED Viewed

	@@ -0,0 +1,15 @@

+## Q1
+**Mac直接加载量化后的模型出现提示 `clang: error: unsupported option '-fopenmp'**
+这是由于Mac由于本身缺乏omp导致的，此时可运行但是单核。需要单独安装 openmp 依赖，即可在Mac下使用OMP：
+```bash
+# 参考`https://mac.r-project.org/openmp/`
+## 假设: gcc(clang)是14.x版本，其他版本见R-Project提供的表格
+curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
+sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
+```
+此时会安装下面几个文件：`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。
+> 注意：如果你之前运行`ChatGLM`项目失败过，最好清一下Hugging Face的缓存，i.e. 默认下是 `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令，请明确知道自己在删除什么。

MODEL_LICENSE ADDED Viewed

	@@ -0,0 +1,65 @@

+The ChatGLM-6B License
+一、定义
+“许可方”是指分发其软件的 ChatGLM2-6B 模型团队。
+“软件”是指根据本许可提供的 ChatGLM2-6B 模型参数。
+2. 许可授予
+根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可，仅用于您的非商业研究目的。
+上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
+3.限制
+您不得出于任何商业、军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
+您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
+4.免责声明
+本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
+5. 责任限制
+除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
+6.争议解决
+本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
+请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 glm-130b@googlegroups.com 与我们联系。
+1. Definitions
+“Licensor” means the ChatGLM2-6B Model Team that distributes its Software.
+“Software” means the ChatGLM2-6B model parameters made available under this license.
+2. License Grant
+Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+3. Restriction
+You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
+You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
+4. Disclaimer
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+5. Limitation of Liability
+EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+6. Dispute Resolution
+This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
+Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at glm-130b@googlegroups.com.

README.md CHANGED Viewed

@@ -1,13 +1,341 @@
 ---
-title: ChatGLM2 6B
-emoji: 💻
-colorFrom: red
-colorTo: green
 sdk: gradio
 sdk_version: 3.35.2
-app_file: app.py
-pinned: false
-license: unknown
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ChatGLM2-6B
+app_file: web_demo.py
 sdk: gradio
 sdk_version: 3.35.2
 ---
+# ChatGLM2-6B
+<p align="center">
+🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
+</p>
+<p align="center">
+    👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1udqapmrr-ocT1DS_mxWe6dDY8ahRWzg" target="_blank">Slack</a> 和 <a href="resources/WECHAT.md" target="_blank">WeChat</a>
+</p>
+*Read this in [English](README_EN.md)*
+## 介绍
+ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM**2**-6B 引入了如下新特性：
+1. **更强大的性能**：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 [GLM](https://github.com/THUDM/GLM) 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，[评测结果](#评测结果)显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%） 、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。
+2. **更长的上下文**：基于 [FlashAttention](https://github.com/HazyResearch/flash-attention) 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练，允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限，我们会在后续迭代升级中着重进行优化。
+3. **更高效的推理**：基于 [Multi-Query Attention](http://arxiv.org/abs/1911.02150) 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。
+4. **更开放的协议**：ChatGLM2-6B 权重对学术研究**完全开放**，在获得官方的书面许可后，亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。
+-----
+ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守[开源协议](MODEL_LICENSE)，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。**目前，本项目团队未基于 ChatGLM2-6B 开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。**
+尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性，但由于 ChatGLM2-6B 模型规模较小，且模型受概率随机性因素影响，无法保证输出内容的准确性，且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
+## 评测结果
+我们选取了部分中英文典型数据集进行了评测，以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)（中文）、[GSM8K](https://github.com/openai/grade-school-math)（数学）、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)（英文） 上的测评结果。在 [evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。
+### MMLU
+| Model | Average | STEM | Social Sciences | Humanities | Others |
+| ----- | ----- | ---- | ----- | ----- | ----- |
+| ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
+| ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
+| ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
+> Chat 模型使用 zero-shot CoT (Chain-of-Thought) 的方法测试，Base 模型使用 few-shot answer-only 的方法测试
+### C-Eval
+| Model | Average | STEM | Social Sciences | Humanities | Others |
+| ----- | ---- | ---- | ----- | ----- | ----- |
+| ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
+| ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
+| ChatGLM2-6B | 50.1 | 46.4	| 60.4 | 50.6 | 46.9 |
+> Chat 模型使用 zero-shot CoT 的方法测试，Base 模型使用 few-shot answer only 的方法测试
+### GSM8K
+| Model | Accuracy | Accuracy (Chinese)* |
+| ----- | ----- | ----- |
+| ChatGLM-6B | 4.82 | 5.85 |
+| ChatGLM2-6B (base) | 32.37 | 28.95 |
+| ChatGLM2-6B | 28.05 | 20.45 |
+> 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 http://arxiv.org/abs/2201.11903
+>
+> \* 我们使用翻译 API 翻译了 GSM8K 中的 500 道题目和 CoT prompt 并进行了人工校对
+### BBH
+| Model | Accuracy |
+| ----- | ----- |
+| ChatGLM-6B | 18.73 |
+| ChatGLM2-6B (base) | 33.68 |
+| ChatGLM2-6B | 30.00 |
+> 所有模型均使用 few-shot CoT 的方法测试，CoT prompt 来自 https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
+## 推理性能
+ChatGLM2-6B 使用了 [Multi-Query Attention](http://arxiv.org/abs/1911.02150)，提高了生成速度。生成 2000 个字符的平均速度对比如下
+| Model | 推理速度 (字符/秒) |
+| ----  | -----  |
+| ChatGLM-6B  | 31.49 |
+| ChatGLM2-6B | 44.62 |
+> 使用官方实现，batch size = 1，max length = 2048，bf16 精度，测试硬件为 A100-SXM4-80G，软件环境为 PyTorch 2.0.1
+Multi-Query Attention 同时也降低了生成过程中 KV Cache 的显存占用，此外，ChatGLM2-6B 采用 Causal Mask 进行对话训练，连续对话时可复用前面轮次的 KV Cache，进一步优化了显存占用。因此，使用 6GB 显存的显卡进行 INT4 量化的推理时，初代的 ChatGLM-6B 模型最多能够生成 1119 个字符就会提示显存耗尽，而 ChatGLM2-6B 能够生成至少 8192 个字符。
+| **量化等级** | **编码 2048 长度的最小显存** | **生成 8192 长度的最小显存** |
+| -------------- |---------------------|---------------------|
+| FP16 / BF16 | 13.1 GB             | 12.8 GB             |
+| INT8           | 8.2 GB              | 8.1 GB              |
+| INT4           | 5.5 GB              | 5.1 GB              |
+> ChatGLM2-6B 利用了 PyTorch 2.0 引入的 `torch.nn.functional.scaled_dot_product_attention` 实现高效的 Attention 计算，如果 PyTorch 版本较低则会 fallback 到朴素的 Attention 实现，出现显存占用高于上表的情况。
+我们也测试了量化对模型性能的影响。结果表明，量化对模型性能的影响在可接受范围内。
+| 量化等级 | Accuracy (MMLU) | Accuracy (C-Eval dev) |
+| ----- | ----- |-----------------------|
+| BF16 | 45.47 | 53.57                 |
+| INT4 | 43.13 | 50.30                 |
+## ChatGLM2-6B 示例
+相比于初代模型，ChatGLM2-6B 多个维度的能力都取得了提升，以下是一些对比示例。更多 ChatGLM2-6B 的可能，等待你来探索发现！
+<details><summary><b>数理逻辑</b></summary>
+![](resources/math.png)
+</details>
+<details><summary><b>知识推理</b></summary>
+![](resources/knowledge.png)
+</details>
+<details><summary><b>长文档理解</b></summary>
+![](resources/long-context.png)
+</details>
+## 使用方式
+### 环境安装
+首先需要下载本仓库：
+```shell
+git clone https://github.com/THUDM/ChatGLM2-6B
+cd ChatGLM2-6B
+```
+然后使用 pip 安装依赖：`pip install -r requirements.txt`，其中 `transformers` 库版本推荐为 `4.30.2`，`torch` 推荐使用 2.0 以上的版本，以获得最佳的推理性能。
+### 代码调用
+可以通过如下代码调用 ChatGLM2-6B 模型来生成对话：
+```python
+>>> from transformers import AutoTokenizer, AutoModel
+>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+>>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda')
+>>> model = model.eval()
+>>> response, history = model.chat(tokenizer, "你好", history=[])
+>>> print(response)
+你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
+>>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
+>>> print(response)
+晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
+1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
+2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
+3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
+4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
+5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
+6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡���试着慢慢吸气,保持几秒钟,然后缓慢呼气。
+如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
+```
+#### 从本地加载模型
+以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm2-6b)。如果你的网络环境较差，下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地，然后从本地加载。
+从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)，然后运行
+```Shell
+git clone https://huggingface.co/THUDM/chatglm2-6b
+```
+如果你从 Hugging Face Hub 上下载 checkpoint 的速度较慢，可以只下载模型实现
+```Shell
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/THUDM/chatglm2-6b
+```
+然后从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载模型参数文件，并将下载的文件替换到本地的 `chatglm2-6b` 目录下。
+将模型下载到本地之后，将以上代码中的 `THUDM/chatglm2-6b` 替换为你本地的 `chatglm2-6b` 文件夹的路径，即可从本地加载模型。
+模型的实现仍然处在变动中。如果希望固定使用的模型实现以保证兼容性，可以在 `from_pretrained` 的调用中增加 `revision="v1.0"` 参数。`v1.0` 是当前最新的版本号，完整的版本列表参见 [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log)。
+### 网页版 Demo
+![web-demo](resources/web-demo.gif)
+首先安装 Gradio：`pip install gradio`，然后运行仓库中的 [web_demo.py](web_demo.py)：
+```shell
+python web_demo.py
+```
+程序会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。
+> 默认使用了 `share=False` 启动，不会生成公网链接。如有需要公网访问的需求，可以修改为 `share=True` 启动。
+>
+感谢 [@AdamBear](https://github.com/AdamBear) 实现了基于 Streamlit 的网页版 Demo `web_demo2.py`。使用时首先需要额外安装以下依赖：
+```shell
+pip install streamlit streamlit-chat
+```
+然后通过以下命令运行：
+```shell
+streamlit run web_demo2.py
+```
+经测试，如果输入的 prompt 较长的话，使用基于 Streamlit 的网页版 Demo 会更流畅。
+### 命令行 Demo
+![cli-demo](resources/cli-demo.png)
+运行仓库中 [cli_demo.py](cli_demo.py)：
+```shell
+python cli_demo.py
+```
+程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。
+### API 部署
+首先需要安装额外的依赖 `pip install fastapi uvicorn`，然后运行仓库中的 [api.py](api.py)：
+```shell
+python api.py
+```
+默认部署在本地的 8000 端口，通过 POST 方法进行调用
+```shell
+curl -X POST "http://127.0.0.1:8000" \
+     -H 'Content-Type: application/json' \
+     -d '{"prompt": "你好", "history": []}'
+```
+得到的返回值为
+```shell
+{
+  "response":"你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。",
+  "history":[["你好","你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。"]],
+  "status":200,
+  "time":"2023-03-23 21:38:40"
+}
+```
+感谢 [@hiyouga]() 实现了 OpenAI 格式的流式 API 部署，可以作为任意基于 ChatGPT 的应用的后端，比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api.py) 进行部署：
+```shell
+python openai_api.py
+```
+进行 API 调用的示例代码为
+```python
+import openai
+if __name__ == "__main__":
+    openai.api_base = "http://localhost:8000/v1"
+    openai.api_key = "none"
+    for chunk in openai.ChatCompletion.create(
+        model="chatglm2-6b",
+        messages=[
+            {"role": "user", "content": "你好"}
+        ],
+        stream=True
+    ):
+        if hasattr(chunk.choices[0].delta, "content"):
+            print(chunk.choices[0].delta.content, end="", flush=True)
+```
+## 低成本部署
+### 模型量化
+默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
+```python
+# 按需修改，目前只支持 4/8 bit 量化
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
+```
+模型量化会带来一定的性能损失，经过测试，ChatGLM2-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
+如果你的内存不足，可以直接加载量化后的模型：
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).cuda()
+```
+<!-- 量化模型的参数文件也可以从[这里](https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/)手动下载。 -->
+### CPU 部署
+如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
+```
+如果你的内存不足的话，也可以使用量化后的模型
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4",trust_remote_code=True).float()
+```
+在 cpu 上运行量化后的模型需要安装 `gcc` 与 `openmp`。多数 Linux 发行版默认已安装。对于 Windows ，可在安装 [TDM-GCC](https://jmeubank.github.io/tdm-gcc/) 时勾选 `openmp`。 Windows 测试环境 `gcc` 版本为 `TDM-GCC 10.3.0`， Linux 为 `gcc 11.3.0`。在 MacOS 上请参考 [Q1](FAQ.md#q1)。
+### Mac 部署
+对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM2-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是 2.x.x）。
+目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端：
+```python
+model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
+```
+加载半精度的 ChatGLM2-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。
+此时可以使用量化后的模型 chatglm2-6b-int4。因为 GPU 上量化的 kernel 是使用 CUDA 编写的，因此无法在 MacOS 上使用，只能使用 CPU 进行推理。
+为了充分使用 CPU 并行，还需要[单独安装 OpenMP](FAQ.md#q1)。
+### 多卡部署
+如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`，然后通过如下方法加载模型：
+```python
+from utils import load_model_on_gpus
+model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
+```
+即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 `device_map` 参数来自己指定。
+## 协议
+本仓库的代码依照 [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) 协议开源，ChatGLM2-6B 模型的权重的使用则需要遵循 [Model License](MODEL_LICENSE)。ChatGLM2-6B 权重对学术研究**完全开放**，在获得官方的书面许可后，亦**允许商业使用**。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。申请商用许可与捐赠请联系 [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn)。
+## 引用
+如果你觉得我们的工作有帮助的话，请考虑引用下列论文，ChatGLM2-6B 的论文会在近期公布，敬请期待～
+```
+@article{zeng2022glm,
+  title={Glm-130b: An open bilingual pre-trained model},
+  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
+  journal={arXiv preprint arXiv:2210.02414},
+  year={2022}
+}
+```
+```
+@inproceedings{du2022glm,
+  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
+  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
+  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
+  pages={320--335},
+  year={2022}
+}
+```

README_EN.md ADDED Viewed

	@@ -0,0 +1,256 @@

+<p align="center">
+🤗 <a href="https://huggingface.co/THUDM/chatglm2-6b" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
+</p>
+<p align="center">
+    👋 Join our <a href="https://join.slack.com/t/chatglm/shared_invite/zt-1udqapmrr-ocT1DS_mxWe6dDY8ahRWzg" target="_blank">Slack</a> and <a href="resources/WECHAT.md" target="_blank">WeChat</a>
+</p>
+## Introduction
+ChatGLM**2**-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B). It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:
+1. **Stronger Performance**: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of [GLM](https://github.com/THUDM/GLM), and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The [evaluation results](README.md#evaluation-results) show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size.
+2. **Longer Context**: Based on [FlashAttention](https://github.com/HazyResearch/flash-attention) technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations.
+3. **More Efficient Inference**: Based on [Multi-Query Attention](http://arxiv.org/abs/1911.02150) technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official  implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K.
+4. **More Open License**: The weights of ChatGLM2-6B are **fully open** to academic research, and with our official written permission, the weights of ChatGLM2-6B are also **permitted for commercial use**. If you find our open-source model useful for your business, we welcome your donation towards the development of the next-generation model ChatGLM3.
+-----
+The open-source ChatGLM2-6B is intended to promote the development of LLMs together with the open-source community. We earnestly request developers and everyone to abide by the [open-source license](MODEL_LICENSE). Do not use the open-source model, code, or any derivatives from the open-source project for any purposes that may harm nations or societies, or for any services that have not undergone safety assessments and legal approval. **At present, our project team has not developed any applications based on ChatGLM2-6B, including web, Android, Apple iOS, and Windows App applications.**
+Although the model strives to ensure the compliance and accuracy of data at each stage of training, due to the smaller scale of the ChatGLM2-6B model, and its susceptibility to probabilistic randomness, the accuracy of output content cannot be guaranteed, and the model can easily be misled. **Our project does not assume any risks or responsibilities arising from data security, public opinion risks, or any instances of the model being misled, abused, disseminated, or improperly used due to the open-source model and code.**
+## Evaluation
+We selected some typical Chinese and English datasets for evaluation. Below are the evaluation results of the ChatGLM2-6B model on [MMLU](https://github.com/hendrycks/test) (English), [C-Eval](https://cevalbenchmark.com/static/leaderboard.html) (Chinese), [GSM8K](https://github.com/openai/grade-school-math) (Mathematics), [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard) (English).
+### MMLU
+| Model | Average | STEM | Social Sciences | Humanities | Others |
+| ----- | ----- | ---- | ----- | ----- | ----- |
+| ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 |
+| ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 |
+| ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 |
+> Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
+### C-Eval
+| Model | Average | STEM | Social Sciences | Humanities | Others |
+| ----- | ---- | ---- | ----- | ----- | ----- |
+| ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 |
+| ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 |
+| ChatGLM2-6B | 50.1 | 46.4	| 60.4 | 50.6 | 46.9 |
+> Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only
+### GSM8K
+| Model | Accuracy | Accuracy (Chinese)* |
+| ----- | ----- | ----- |
+| ChatGLM-6B | 4.82 | 5.85 |
+| ChatGLM2-6B (base) | 32.37 | 28.95 |
+| ChatGLM2-6B | 28.05 | 20.45 |
+> All model versions are evaluated under few-shot CoT, and CoT prompts are from http://arxiv.org/abs/2201.11903
+> \* We translate a 500-query subset of GSM8K and its corresponding CoT prompts using machine translation API and subsequent human proofreading.
+### BBH
+| Model | Accuracy |
+| ----- | ----- |
+| ChatGLM-6B | 18.73 |
+| ChatGLM2-6B (base) | 33.68 |
+| ChatGLM2-6B | 30.00 |
+> All model versions are evaluated under few-shot CoT, and CoT prompts are from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts
+## Inference Efficiency
+ChatGLM2-6B employs [Multi-Query Attention](http://arxiv.org/abs/1911.02150) to improve inference speed. Here is a comparison of the average speed for generating 2000 tokens.
+| Model | Inference Speed (tokens/s) |
+| ----  | -----  |
+| ChatGLM-6B  | 31.49 |
+| ChatGLM2-6B | 44.62 |
+> Under our official implementation, batch size = 1, max length = 2048, bf16 precision, tested with an A100-SXM-80G and PyTorch 2.0 environment
+Multi-Query Attention also reduces the GPU memory usage of the KV Cache during inference. Additionally, ChatGLM2-6B uses Causal Mask for dialogue training, which allows the reuse of the KV Cache from previous rounds in continuous dialogues, further optimizing GPU memory usage. Therefore, when performing INT4 quantization inference with a 6GB GPU, while the first-generation ChatGLM-6B can only generate a maximum of 1119 tokens before running out of memory, ChatGLM2-6B can generate at least 8192 tokens.
+| **Quantization** | **Encoding 2048 Tokens** | **Decoding 8192 Tokens** |
+| -------------- | --------------------- | --------------- |
+| FP16 / BF16 | 13.1 GB             | 12.8 GB             |
+| INT8           | 8.2 GB              | 8.1 GB              |
+| INT4           | 5.5 GB              | 5.1 GB              |
+> ChatGLM2-6B takes advantage of `torch.nn.functional.scaled_dot_product_attention` introduced in PyTorch 2.0 for efficient Attention computation. If the PyTorch version is lower, it will fallback to the naive Attention implementation, which may result in higher GPU memory usage than shown in the table above.
+We also tested the impact of quantization on model performance. The results show that the impact of quantization on model performance is within an acceptable range.
+| Quantization | Accuracy (MMLU) | Accuracy (C-Eval dev) |
+| ----- | ----- |-----------------------|
+| BF16 | 45.47 | 53.57                 |
+| INT4 | 43.13 | 50.30                 |
+## ChatGLM2-6B Examples
+Compared to the first-generation model, ChatGLM2-6B has made improvements in multiple dimensions. Below are some comparison examples. More possibilities with ChatGLM2-6B are waiting for you to explore and discover!
+<details><summary><b>Mathematics and Logic</b></summary>
+![](examples/math.png)
+</details>
+<details><summary><b>Knowledge Reasoning</b></summary>
+![](examples/knowledge.png)
+</details>
+<details><summary><b>Long Document Understanding</b></summary>
+![](examples/long-context.png)
+</details>
+## Getting Started
+### Environment Setup
+Install dependencies with pip: `pip install -r requirements.txt`. It's recommended to use version `4.27.1` for the `transformers` library and use version 2.0 or higher for `torch` to achieve the best inference performance.
+We provide a web page demo and a command line demo. You need to download this repository to use them:
+```shell
+git clone https://github.com/THUDM/ChatGLM2-6B
+cd ChatGLM2-6B
+```
+### Usage
+Generate dialogue with the following code:
+```python
+>>> from transformers import AutoTokenizer, AutoModel
+>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+>>> model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device='cuda').eval()
+>>> response, history = model.chat(tokenizer, "你好", history=[])
+>>> print(response)
+你好👋!我是人工智能助手 ChatGLM2-6B,很高兴见到你,欢迎问我任何问题。
+>>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
+>>> print(response)
+晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
+1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
+2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
+3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
+4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
+5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
+6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
+如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
+```
+The implementation of the model is still in development. If you want to fix the used model implementation to ensure compatibility, you can add the `revision="v1.0"` parameter in the `from_pretrained` call. `v1.0` is the latest version number. For a complete list of versions, see [Change Log](https://huggingface.co/THUDM/chatglm2-6b#change-log).
+### Web Demo
+![web-demo](resources/web-demo.gif)
+Install Gradio `pip install gradio`，and run [web_demo.py](web_demo.py):
+```shell
+python web_demo.py
+```
+The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo.
+#### CLI Demo
+![cli-demo](resources/cli-demo.png)
+Run [cli_demo.py](cli_demo.py) in the repo:
+```shell
+python cli_demo.py
+```
+The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program.
+## API Deployment
+First install the additional dependency `pip install fastapi uvicorn`. The run [api.py](api.py) in the repo.
+```shell
+python api.py
+```
+By default the api runs at the`8000`port of the local machine. You can call the API via
+```shell
+curl -X POST "http://127.0.0.1:8000" \
+     -H 'Content-Type: application/json' \
+     -d '{"prompt": "你好", "history": []}'
+```
+The returned value is
+```shell
+{
+  "response":"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。",
+  "history":[["你好","你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。"]],
+  "status":200,
+  "time":"2023-03-23 21:38:40"
+}
+```
+## Deployment
+### Quantization
+By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization:
+```python
+# hange according to your hardware. Only support 4/8 bit quantization now.
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
+```
+Model quantization will bring some performance loss on datasets. But after testing, ChatGLM2-6B can still perform natural and smooth generation under 4-bit quantization.
+### CPU Deployment
+If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).float()
+```
+### Inference on Mac
+For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. (The correct version number should be 2.1.0.dev2023xxxx, not 2.0.0).
+Currently you must [load the model locally](README_en.md#load-the-model-locally) on MacOS. Change the code to load the model from your local path, and use the mps backend:
+```python
+model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
+```
+Loading a FP16 ChatGLM-6B model requires about 13GB of memory. Machines with less memory (such as a MacBook Pro with 16GB of memory) will use the virtual memory on the hard disk when there is insufficient free memory, resulting in a serious slowdown in inference speed.
+## License
+The code of this repository is licensed under [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0). The use of the ChatGLM2-6B model weights is subject to the [Model License](MODEL_LICENSE). ChatGLM2-6B weights are **completely open** for academic research, and **commercial use** is also allowed after **obtaining official written permission**. If you find our open source model useful for your business, we welcome your donations towards the development of the next generation model, ChatGLM3. For related matters, please contact [yiwen.xu@zhipuai.cn](mailto:yiwen.xu@zhipuai.cn).
+## Citation
+If you find our work useful, please consider citing the following papers. The technical report for ChatGLM2-6B will be out soon.
+```
+@article{zeng2022glm,
+  title={Glm-130b: An open bilingual pre-trained model},
+  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
+  journal={arXiv preprint arXiv:2210.02414},
+  year={2022}
+}
+```
+```
+@inproceedings{du2022glm,
+  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
+  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
+  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
+  pages={320--335},
+  year={2022}
+}
+```

api.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from fastapi import FastAPI, Request
+from transformers import AutoTokenizer, AutoModel
+import uvicorn, json, datetime
+import torch
+DEVICE = "cuda"
+DEVICE_ID = "0"
+CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
+def torch_gc():
+    if torch.cuda.is_available():
+        with torch.cuda.device(CUDA_DEVICE):
+            torch.cuda.empty_cache()
+            torch.cuda.ipc_collect()
+app = FastAPI()
+@app.post("/")
+async def create_item(request: Request):
+    global model, tokenizer
+    json_post_raw = await request.json()
+    json_post = json.dumps(json_post_raw)
+    json_post_list = json.loads(json_post)
+    prompt = json_post_list.get('prompt')
+    history = json_post_list.get('history')
+    max_length = json_post_list.get('max_length')
+    top_p = json_post_list.get('top_p')
+    temperature = json_post_list.get('temperature')
+    response, history = model.chat(tokenizer,
+                                   prompt,
+                                   history=history,
+                                   max_length=max_length if max_length else 2048,
+                                   top_p=top_p if top_p else 0.7,
+                                   temperature=temperature if temperature else 0.95)
+    now = datetime.datetime.now()
+    time = now.strftime("%Y-%m-%d %H:%M:%S")
+    answer = {
+        "response": response,
+        "history": history,
+        "status": 200,
+        "time": time
+    }
+    log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
+    print(log)
+    torch_gc()
+    return answer
+if __name__ == '__main__':
+    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+    model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
+    # 多显卡支持，使用下面三行代替上面两行，将num_gpus改为你实际的显卡数量
+    # model_path = "THUDM/chatglm2-6b"
+    # tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    # model = load_model_on_gpus(model_path, num_gpus=2)
+    model.eval()
+    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

cli_demo.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+import platform
+import signal
+from transformers import AutoTokenizer, AutoModel
+import readline
+tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
+# 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
+# from utils import load_model_on_gpus
+# model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
+model = model.eval()
+os_name = platform.system()
+clear_command = 'cls' if os_name == 'Windows' else 'clear'
+stop_stream = False
+def build_prompt(history):
+    prompt = "欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序"
+    for query, response in history:
+        prompt += f"\n\n用户：{query}"
+        prompt += f"\n\nChatGLM2-6B：{response}"
+    return prompt
+def signal_handler(signal, frame):
+    global stop_stream
+    stop_stream = True
+def main():
+    past_key_values, history = None, []
+    global stop_stream
+    print("欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
+    while True:
+        query = input("\n用户：")
+        if query.strip() == "stop":
+            break
+        if query.strip() == "clear":
+            past_key_values, history = None, []
+            os.system(clear_command)
+            print("欢迎使用 ChatGLM2-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
+            continue
+        print("\nChatGLM：", end="")
+        current_length = 0
+        for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history,
+                                                                    past_key_values=past_key_values,
+                                                                    return_past_key_values=True):
+            if stop_stream:
+                stop_stream = False
+                break
+            else:
+                print(response[current_length:], end="", flush=True)
+                current_length = len(response)
+        print("")
+if __name__ == "__main__":
+    main()

evaluation/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+首先从 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/e84444333b6d434ea7b0) 下载处理好的 C-Eval 数据集，解压到 `evaluation` 目录下。然后运行
+```shell
+cd evaluation
+python evaluate_ceval.py
+```
+这个脚本会在C-Eval的验证集上进行预测并输出准确率。如果想要得到测试集上的结果可以将代码中的 `./CEval/val/**/*.jsonl` 改为 `./CEval/test/**/*.jsonl`，并按照 C-Eval 规定的格式保存结果并在 [官网](https://cevalbenchmark.com/) 上提交。
+汇报的结果使用的是内部的并行测试框架，结果可能会有轻微波动。

evaluation/evaluate_ceval.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+import glob
+import re
+import json
+import torch
+import torch.utils.data
+from transformers import AutoTokenizer, AutoModel
+from tqdm import tqdm
+tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).bfloat16().cuda()
+choices = ["A", "B", "C", "D"]
+choice_tokens = [tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
+def build_prompt(text):
+    return "[Round {}]\n\n问：{}\n\n答：".format(1, text)
+extraction_prompt = '综上所述，ABCD中正确的选项是：'
+accuracy_dict, count_dict = {}, {}
+with torch.no_grad():
+    for entry in glob.glob("./CEval/val/**/*.jsonl", recursive=True):
+        dataset = []
+        with open(entry, encoding='utf-8') as file:
+            for line in file:
+                dataset.append(json.loads(line))
+        correct = 0
+        dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
+        for batch in tqdm(dataloader):
+            texts = batch["inputs_pretokenized"]
+            queries = [build_prompt(query) for query in texts]
+            inputs = tokenizer(queries, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
+            outputs = model.generate(**inputs, do_sample=False, max_new_tokens=512)
+            intermediate_outputs = []
+            for idx in range(len(outputs)):
+                output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
+                response = tokenizer.decode(output)
+                intermediate_outputs.append(response)
+            answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
+                            zip(texts, intermediate_outputs)]
+            input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
+            inputs = tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
+            outputs = model(**inputs, return_last_logit=True)
+            logits = outputs.logits[:, -1]
+            logits = logits[:, choice_tokens]
+            preds = logits.argmax(dim=-1)
+            correct += (preds.cpu() == batch["label"]).sum().item()
+        accuracy = correct / len(dataset)
+        print(entry, accuracy)
+        accuracy_dict[entry] = accuracy
+        count_dict[entry] = len(dataset)
+acc_total, count_total = 0.0, 0
+for key in accuracy_dict:
+    acc_total += accuracy_dict[key] * count_dict[key]
+    count_total += count_dict[key]
+print(acc_total / count_total)

openai_api.py ADDED Viewed

	@@ -0,0 +1,174 @@

+# coding=utf-8
+# Implements API for ChatGLM2-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
+# Usage: python openai_api.py
+# Visit http://localhost:8000/docs for documents.
+import time
+import torch
+import uvicorn
+from pydantic import BaseModel, Field
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from contextlib import asynccontextmanager
+from starlette.responses import StreamingResponse
+from typing import Any, Dict, List, Literal, Optional, Union
+from transformers import AutoTokenizer, AutoModel
+@asynccontextmanager
+async def lifespan(app: FastAPI): # collects GPU memory
+    yield
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+app = FastAPI(lifespan=lifespan)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "owner"
+    root: Optional[str] = None
+    parent: Optional[str] = None
+    permission: Optional[list] = None
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard] = []
+class ChatMessage(BaseModel):
+    role: Literal["user", "assistant", "system"]
+    content: str
+class DeltaMessage(BaseModel):
+    role: Optional[Literal["user", "assistant", "system"]] = None
+    content: Optional[str] = None
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = None
+    top_p: Optional[float] = None
+    max_length: Optional[int] = None
+    stream: Optional[bool] = False
+class ChatCompletionResponseChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Literal["stop", "length"]
+class ChatCompletionResponseStreamChoice(BaseModel):
+    index: int
+    delta: DeltaMessage
+    finish_reason: Optional[Literal["stop", "length"]]
+class ChatCompletionResponse(BaseModel):
+    model: str
+    object: Literal["chat.completion", "chat.completion.chunk"]
+    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
+    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    global model_args
+    model_card = ModelCard(id="gpt-3.5-turbo")
+    return ModelList(data=[model_card])
+@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+async def create_chat_completion(request: ChatCompletionRequest):
+    global model, tokenizer
+    if request.messages[-1].role != "user":
+        raise HTTPException(status_code=400, detail="Invalid request")
+    query = request.messages[-1].content
+    prev_messages = request.messages[:-1]
+    if len(prev_messages) > 0 and prev_messages[0].role == "system":
+        query = prev_messages.pop(0).content + query
+    history = []
+    if len(prev_messages) % 2 == 0:
+        for i in range(0, len(prev_messages), 2):
+            if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant":
+                history.append([prev_messages[i].content, prev_messages[i+1].content])
+    if request.stream:
+        generate = predict(query, history, request.model)
+        return StreamingResponse(generate, media_type="text/event-stream")
+    response, _ = model.chat(tokenizer, query, history=history)
+    choice_data = ChatCompletionResponseChoice(
+        index=0,
+        message=ChatMessage(role="assistant", content=response),
+        finish_reason="stop"
+    )
+    return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion")
+async def predict(query: str, history: List[List[str]], model_id: str):
+    global model, tokenizer
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant"),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    yield "data: {}\n\n".format(chunk.json(exclude_unset=True, ensure_ascii=False))
+    current_length = 0
+    for new_response, _ in model.stream_chat(tokenizer, query, history):
+        if len(new_response) == current_length:
+            continue
+        new_text = new_response[current_length:]
+        current_length = len(new_response)
+        choice_data = ChatCompletionResponseStreamChoice(
+            index=0,
+            delta=DeltaMessage(content=new_text),
+            finish_reason=None
+        )
+        chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+        yield "data: {}\n\n".format(chunk.json(exclude_unset=True, ensure_ascii=False))
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(),
+        finish_reason="stop"
+    )
+    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    yield "data: {}\n\n".format(chunk.json(exclude_unset=True, ensure_ascii=False))
+if __name__ == "__main__":
+    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+    model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
+    # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
+    # from utils import load_model_on_gpus
+    # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
+    model.eval()
+    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+protobuf
+transformers==4.30.2
+cpm_kernels
+torch>=2.0
+gradio
+mdtex2html
+sentencepiece
+accelerate

resources/WECHAT.md ADDED Viewed

	@@ -0,0 +1,7 @@

+<div align="center">
+<img src=wechat.jpg width="60%"/>
+<p> 扫码关注公众号，加入「ChatGLM交流群」 </p>
+<p> Scan the QR code to follow the official account and join the "ChatGLM Discussion Group" </p>
+</div>

resources/cli-demo.png ADDED Viewed

resources/knowledge.png ADDED Viewed

resources/long-context.png ADDED Viewed

Git LFS Details

SHA256: 9df24161083739a775aa47abeb53a95ab066ad498192d061b0a4941fcc74f35c
Pointer size: 132 Bytes
Size of remote file: 1.11 MB

resources/math.png ADDED Viewed

resources/web-demo.gif ADDED Viewed

Git LFS Details

SHA256: ba8ff042bbd879cbb4dd3795081b2e4e3713d3a4d2d5d7d61a027c389324cbbc
Pointer size: 132 Bytes
Size of remote file: 2.28 MB

resources/web-demo.png ADDED Viewed

resources/wechat.jpg ADDED Viewed

utils.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import os
+from typing import Dict, Tuple, Union, Optional
+from torch.nn import Module
+from transformers import AutoModel
+def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
+    # transformer.word_embeddings 占用1层
+    # transformer.final_layernorm 和 lm_head 占用1层
+    # transformer.layers 占用 28 层
+    # 总共30层分配到num_gpus张卡上
+    num_trans_layers = 28
+    per_gpu_layers = 30 / num_gpus
+    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
+    # windows下 model.device 会被设置成 transformer.word_embeddings.device
+    # linux下 model.device 会被设置成 lm_head.device
+    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
+    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
+    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
+    # 本文件来源于https://github.com/THUDM/ChatGLM-6B/blob/main/utils.py
+    # 仅此处做少许修改以支持ChatGLM2
+    device_map = {
+        'transformer.embedding.word_embeddings': 0,
+        'transformer.encoder.final_layernorm': 0,
+        'transformer.output_layer': 0,
+        'transformer.rotary_pos_emb': 0,
+        'lm_head': 0
+    }
+    used = 2
+    gpu_target = 0
+    for i in range(num_trans_layers):
+        if used >= per_gpu_layers:
+            gpu_target += 1
+            used = 0
+        assert gpu_target < num_gpus
+        device_map[f'transformer.encoder.layers.{i}'] = gpu_target
+        used += 1
+    return device_map
+def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
+                       device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
+    if num_gpus < 2 and device_map is None:
+        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
+    else:
+        from accelerate import dispatch_model
+        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()
+        if device_map is None:
+            device_map = auto_configure_device_map(num_gpus)
+        model = dispatch_model(model, device_map=device_map)
+    return model

web_demo.py ADDED Viewed

	@@ -0,0 +1,108 @@

+from transformers import AutoModel, AutoTokenizer
+import gradio as gr
+import mdtex2html
+from utils import load_model_on_gpus
+tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
+# 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
+# from utils import load_model_on_gpus
+# model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
+model = model.eval()
+"""Override Chatbot.postprocess"""
+def postprocess(self, y):
+    if y is None:
+        return []
+    for i, (message, response) in enumerate(y):
+        y[i] = (
+            None if message is None else mdtex2html.convert((message)),
+            None if response is None else mdtex2html.convert(response),
+        )
+    return y
+gr.Chatbot.postprocess = postprocess
+def parse_text(text):
+    """copy from https://github.com/GaiZhenbiao/ChuanhuChatGPT/"""
+    lines = text.split("\n")
+    lines = [line for line in lines if line != ""]
+    count = 0
+    for i, line in enumerate(lines):
+        if "```" in line:
+            count += 1
+            items = line.split('`')
+            if count % 2 == 1:
+                lines[i] = f'<pre><code class="language-{items[-1]}">'
+            else:
+                lines[i] = f'<br></code></pre>'
+        else:
+            if i > 0:
+                if count % 2 == 1:
+                    line = line.replace("`", "\`")
+                    line = line.replace("<", "&lt;")
+                    line = line.replace(">", "&gt;")
+                    line = line.replace(" ", "&nbsp;")
+                    line = line.replace("*", "&ast;")
+                    line = line.replace("_", "&lowbar;")
+                    line = line.replace("-", "&#45;")
+                    line = line.replace(".", "&#46;")
+                    line = line.replace("!", "&#33;")
+                    line = line.replace("(", "&#40;")
+                    line = line.replace(")", "&#41;")
+                    line = line.replace("$", "&#36;")
+                lines[i] = "<br>"+line
+    text = "".join(lines)
+    return text
+def predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):
+    chatbot.append((parse_text(input), ""))
+    for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values,
+                                                                return_past_key_values=True,
+                                                                max_length=max_length, top_p=top_p,
+                                                                temperature=temperature):
+        chatbot[-1] = (parse_text(input), parse_text(response))
+        yield chatbot, history, past_key_values
+def reset_user_input():
+    return gr.update(value='')
+def reset_state():
+    return [], [], None
+with gr.Blocks() as demo:
+    gr.HTML("""<h1 align="center">ChatGLM2-6B</h1>""")
+    chatbot = gr.Chatbot()
+    with gr.Row():
+        with gr.Column(scale=4):
+            with gr.Column(scale=12):
+                user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
+                    container=False)
+            with gr.Column(min_width=32, scale=1):
+                submitBtn = gr.Button("Submit", variant="primary")
+        with gr.Column(scale=1):
+            emptyBtn = gr.Button("Clear History")
+            max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)
+            top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)
+            temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)
+    history = gr.State([])
+    past_key_values = gr.State(None)
+    submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],
+                    [chatbot, history, past_key_values], show_progress=True)
+    submitBtn.click(reset_user_input, [], [user_input])
+    emptyBtn.click(reset_state, outputs=[chatbot, history, past_key_values], show_progress=True)
+demo.queue().launch(share=False, inbrowser=True)

web_demo2.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from transformers import AutoModel, AutoTokenizer
+import streamlit as st
+from streamlit_chat import message
+st.set_page_config(
+    page_title="ChatGLM2-6b 演示",
+    page_icon=":robot:",
+    layout='wide'
+)
+@st.cache_resource
+def get_model():
+    tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
+    model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).cuda()
+    # 多显卡支持，使用下面两行代替上面一行，将num_gpus改为你实际的显卡数量
+    # from utils import load_model_on_gpus
+    # model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)
+    model = model.eval()
+    return tokenizer, model
+MAX_TURNS = 20
+MAX_BOXES = MAX_TURNS * 2
+def predict(input, max_length, top_p, temperature, history=None):
+    tokenizer, model = get_model()
+    if history is None:
+        history = []
+    with container:
+        if len(history) > 0:
+            if len(history)>MAX_BOXES:
+                history = history[-MAX_TURNS:]
+            for i, (query, response) in enumerate(history):
+                message(query, avatar_style="big-smile", key=str(i) + "_user")
+                message(response, avatar_style="bottts", key=str(i))
+        message(input, avatar_style="big-smile", key=str(len(history)) + "_user")
+        st.write("AI正在回复:")
+        with st.empty():
+            for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
+                                               temperature=temperature):
+                query, response = history[-1]
+                st.write(response)
+    return history
+container = st.container()
+# create a prompt text for the text generation
+prompt_text = st.text_area(label="用户命令输入",
+            height = 100,
+            placeholder="请在这儿输入您的命令")
+max_length = st.sidebar.slider(
+    'max_length', 0, 32768, 8192, step=1
+)
+top_p = st.sidebar.slider(
+    'top_p', 0.0, 1.0, 0.8, step=0.01
+)
+temperature = st.sidebar.slider(
+    'temperature', 0.0, 1.0, 0.95, step=0.01
+)
+if 'state' not in st.session_state:
+    st.session_state['state'] = []
+if st.button("发送", key="predict"):
+    with st.spinner("AI正在思考，请稍等........"):
+        # text generation
+        st.session_state["state"] = predict(prompt_text, max_length, top_p, temperature, st.session_state["state"])