Spaces:

SevenQin
/

cmkj-gpt

Runtime error

App Files Files Community

SevenQin commited on Aug 30, 2023

Commit

4450c0d

1 Parent(s): eee2a50

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +4 -0
.github/ISSUE_TEMPLATE/bug_report.yaml +88 -0
.github/ISSUE_TEMPLATE/config.yaml +1 -0
.github/ISSUE_TEMPLATE/feature_request.yaml +78 -0
.gitignore +11 -0
.idea/.gitignore +3 -0
.idea/Qwen-7B.iml +14 -0
.idea/inspectionProfiles/profiles_settings.xml +6 -0
.idea/misc.xml +4 -0
.idea/modules.xml +8 -0
.idea/vcs.xml +6 -0
.idea/workspace.xml +42 -0
FAQ.md +85 -0
FAQ_ja.md +85 -0
FAQ_zh.md +80 -0
LICENSE +53 -0
NOTICE +52 -0
README.md +444 -7
README_CN.md +451 -0
README_JA.md +448 -0
assets/cli_demo.gif +0 -0
assets/hfagent_chat_1.png +3 -0
assets/hfagent_chat_2.png +3 -0
assets/hfagent_run.png +3 -0
assets/logo.jpg +0 -0
assets/openai_api.gif +0 -0
assets/performance.png +0 -0
assets/qwen_tokenizer.png +0 -0
assets/react_showcase_001.png +0 -0
assets/react_showcase_002.png +0 -0
assets/react_tutorial_001.png +0 -0
assets/react_tutorial_002.png +0 -0
assets/tokenizer.pdf +0 -0
assets/tokenizer.png +0 -0
assets/wanx_colorful_black.png +3 -0
assets/web_demo.gif +0 -0
assets/wechat.png +0 -0
cli_demo.py +207 -0
eval/EVALUATION.md +96 -0
eval/evaluate_ceval.py +432 -0
eval/evaluate_chat_ceval.py +459 -0
eval/evaluate_chat_gsm8k.py +151 -0
eval/evaluate_chat_humaneval.py +109 -0
eval/evaluate_chat_mmlu.py +314 -0
eval/evaluate_cmmlu.py +325 -0
eval/evaluate_gsm8k.py +127 -0
eval/evaluate_humaneval.py +85 -0
eval/evaluate_mmlu.py +315 -0
eval/evaluate_plugin.py +325 -0
eval/gsm8k_prompt.txt +59 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/hfagent_chat_1.png filter=lfs diff=lfs merge=lfs -text
+assets/hfagent_chat_2.png filter=lfs diff=lfs merge=lfs -text
+assets/hfagent_run.png filter=lfs diff=lfs merge=lfs -text
+assets/wanx_colorful_black.png filter=lfs diff=lfs merge=lfs -text

.github/ISSUE_TEMPLATE/bug_report.yaml ADDED Viewed

	@@ -0,0 +1,88 @@

+name: 🐞 Bug
+description: 提交错误报告 | File a bug/issue
+title: "[BUG] <title>"
+labels: []
+body:
+  - type: checkboxes
+    attributes:
+      label: 是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?
+      description: |
+        请先搜索您遇到的错误是否在已有的issues或讨论中提到过。
+        Please search to see if an issue / discussion already exists for the bug you encountered.
+        [Issues](https://github.com/QwenLM/Qwen-7B/issues)
+        [Discussions](https://github.com/QwenLM/Qwen-7B/discussions)
+      options:
+        - label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
+          required: true
+  - type: checkboxes
+    attributes:
+      label: 该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?
+      description: |
+        请先搜索您遇到的错误是否已在FAQ中有相关解答。
+        Please search to see if an answer already exists in FAQ for the bug you encountered.
+        [FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md)
+        [FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md)
+      options:
+        - label: 我已经搜索过FAQ | I have searched FAQ
+          required: true
+  - type: textarea
+    attributes:
+      label: 当前行为 | Current Behavior
+      description: |
+        准确描述遇到的行为。
+        A concise description of what you're experiencing.
+    validations:
+      required: false
+  - type: textarea
+    attributes:
+      label: 期望行为 | Expected Behavior
+      description: |
+        准确描述预期的行为。
+        A concise description of what you expected to happen.
+    validations:
+      required: false
+  - type: textarea
+    attributes:
+      label: 复现方法 | Steps To Reproduce
+      description: |
+        复现当前行为的详细步骤。
+        Steps to reproduce the behavior.
+      placeholder: |
+        1. In this environment...
+        2. With this config...
+        3. Run '...'
+        4. See error...
+    validations:
+      required: false
+  - type: textarea
+    attributes:
+      label: 运行环境 | Environment
+      description: |
+        examples:
+          - **OS**: Ubuntu 20.04
+          - **Python**: 3.8
+          - **Transformers**: 4.31.0
+          - **PyTorch**: 2.0.1
+          - **CUDA**: 11.4
+      value: |
+        - OS:
+        - Python:
+        - Transformers:
+        - PyTorch:
+        - CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
+      render: Markdown
+    validations:
+      required: false
+  - type: textarea
+    attributes:
+      label: 备注 | Anything else?
+      description: |
+        您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
+        您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
+        Links? References? Anything that will give us more context about the issue you are encountering!
+        Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
+    validations:
+      required: false

.github/ISSUE_TEMPLATE/config.yaml ADDED Viewed

	@@ -0,0 +1 @@


1	+ blank_issues_enabled: true

.github/ISSUE_TEMPLATE/feature_request.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+name: "💡 Feature Request"
+description: 创建新功能请求 | Create a new ticket for a new feature request
+title: "💡 [REQUEST] - <title>"
+labels: [
+  "question"
+]
+body:
+  - type: input
+    id: start_date
+    attributes:
+      label: "起始日期 | Start Date"
+      description: |
+        起始开发日期
+        Start of development
+      placeholder: "month/day/year"
+    validations:
+      required: false
+  - type: textarea
+    id: implementation_pr
+    attributes:
+      label: "实现PR | Implementation PR"
+      description: |
+        实现该功能的Pull request
+        Pull request used
+      placeholder: "#Pull Request ID"
+    validations:
+      required: false
+  - type: textarea
+    id: reference_issues
+    attributes:
+      label: "相关Issues | Reference Issues"
+      description: |
+        与该功能相关的issues
+        Common issues
+      placeholder: "#Issues IDs"
+    validations:
+      required: false
+  - type: textarea
+    id: summary
+    attributes:
+      label: "摘要 | Summary"
+      description: |
+        简要描述新功能的特点
+        Provide a brief explanation of the feature
+      placeholder: |
+        Describe in a few lines your feature request
+    validations:
+      required: true
+  - type: textarea
+    id: basic_example
+    attributes:
+      label: "基本示例 | Basic Example"
+      description: Indicate here some basic examples of your feature.
+      placeholder: A few specific words about your feature request.
+    validations:
+      required: true
+  - type: textarea
+    id: drawbacks
+    attributes:
+      label: "缺陷 | Drawbacks"
+      description: |
+        该新功能有哪些缺陷/可能造成哪些影响？
+        What are the drawbacks/impacts of your feature request ?
+      placeholder: |
+        Identify the drawbacks and impacts while being neutral on your feature request
+    validations:
+      required: true
+  - type: textarea
+    id: unresolved_question
+    attributes:
+      label: "未解决问题 | Unresolved questions"
+      description: |
+        有哪些尚未解决的问题？
+        What questions still remain unresolved ?
+      placeholder: |
+        Identify any unresolved issues.
+    validations:
+      required: false

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__
+*.so
+build
+.coverage_*
+*.egg-info
+*~
+.vscode/
+.idea/
+.DS_Store
+/private/

.idea/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+# 默认忽略的文件
+/shelf/
+/workspace.xml

.idea/Qwen-7B.iml ADDED Viewed

	@@ -0,0 +1,14 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager">
+    <content url="file://$MODULE_DIR$">
+      <excludeFolder url="file://$MODULE_DIR$/venv" />
+    </content>
+    <orderEntry type="jdk" jdkName="Python 3.8" jdkType="Python SDK" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+  <component name="PyDocumentationSettings">
+    <option name="format" value="GOOGLE" />
+    <option name="myDocStringFormat" value="Google" />
+  </component>
+</module>

.idea/inspectionProfiles/profiles_settings.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<component name="InspectionProjectProfileManager">
+  <settings>
+    <option name="USE_PROJECT_PROFILE" value="false" />
+    <version value="1.0" />
+  </settings>
+</component>

.idea/misc.xml ADDED Viewed

	@@ -0,0 +1,4 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectRootManager" version="2" project-jdk-name="Python 3.8" project-jdk-type="Python SDK" />
+</project>

.idea/modules.xml ADDED Viewed

	@@ -0,0 +1,8 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectModuleManager">
+    <modules>
+      <module fileurl="file://$PROJECT_DIR$/.idea/Qwen-7B.iml" filepath="$PROJECT_DIR$/.idea/Qwen-7B.iml" />
+    </modules>
+  </component>
+</project>

.idea/vcs.xml ADDED Viewed

	@@ -0,0 +1,6 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="VcsDirectoryMappings">
+    <mapping directory="$PROJECT_DIR$" vcs="Git" />
+  </component>
+</project>

.idea/workspace.xml ADDED Viewed

	@@ -0,0 +1,42 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ChangeListManager">
+    <list default="true" id="6e38c486-b8fd-44ae-a3ad-dac5e5eec7fb" name="变更" comment="">
+      <change beforePath="$PROJECT_DIR$/README.md" beforeDir="false" afterPath="$PROJECT_DIR$/README.md" afterDir="false" />
+      <change beforePath="$PROJECT_DIR$/web_demo.py" beforeDir="false" afterPath="$PROJECT_DIR$/web_demo.py" afterDir="false" />
+    </list>
+    <option name="SHOW_DIALOG" value="false" />
+    <option name="HIGHLIGHT_CONFLICTS" value="true" />
+    <option name="HIGHLIGHT_NON_ACTIVE_CHANGELIST" value="false" />
+    <option name="LAST_RESOLUTION" value="IGNORE" />
+  </component>
+  <component name="Git.Settings">
+    <option name="RECENT_GIT_ROOT_PATH" value="$PROJECT_DIR$" />
+  </component>
+  <component name="MarkdownSettingsMigration">
+    <option name="stateVersion" value="1" />
+  </component>
+  <component name="ProjectId" id="2UhnaDhw369BTArilFxQUASXjkM" />
+  <component name="ProjectViewState">
+    <option name="hideEmptyMiddlePackages" value="true" />
+    <option name="showLibraryContents" value="true" />
+  </component>
+  <component name="PropertiesComponent"><![CDATA[{
+  "keyToString": {
+    "RunOnceActivity.OpenProjectViewOnStart": "true",
+    "RunOnceActivity.ShowReadmeOnStart": "true",
+    "last_opened_file_path": "E:/Llama/Qwen-7B"
+  }
+}]]></component>
+  <component name="SpellCheckerSettings" RuntimeDictionaries="0" Folders="0" CustomDictionaries="0" DefaultDictionary="应用程序级" UseSingleDictionary="true" transferred="true" />
+  <component name="TaskManager">
+    <task active="true" id="Default" summary="默认任务">
+      <changelist id="6e38c486-b8fd-44ae-a3ad-dac5e5eec7fb" name="变更" comment="" />
+      <created>1693400742376</created>
+      <option name="number" value="Default" />
+      <option name="presentableId" value="Default" />
+      <updated>1693400742376</updated>
+    </task>
+    <servers />
+  </component>
+</project>

FAQ.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# FAQ
+## Installation & Environment
+#### Failure in installing flash attention
+Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. You can use our models without installing it.
+#### Which version of transformers should I use?
+4.31.0 is preferred.
+#### I downloaded the codes and checkpoints but I can't load the model locally. What should I do?
+Please check if you have updated the code to the latest, and correctly downloaded all the sharded checkpoint files.
+#### `qwen.tiktoken` is not found. What is it?
+This is the merge file of the tokenizer. You have to download it. Note that if you just git clone the repo without [git-lfs](https://git-lfs.com), you cannot download this file.
+#### transformers_stream_generator/tiktoken/accelerate not found
+Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt).
+<br><br>
+## Demo & Inference
+#### Is there any demo? CLI demo and Web UI demo?
+Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information.
+#### Can I use CPU only?
+Yes, run `python  cli_demo.py --cpu-only` will load the model and inference on CPU only.
+#### Can Qwen support streaming?
+Yes. See the function `chat_stream` in `modeling_qwen.py`.
+#### Gibberish in result when using chat_stream().
+This is because tokens represent bytes and a single token may be a meaningless string. We have updated the default setting of our tokenizer to avoid such decoding results. Please update the code to the latest version.
+#### It seems that the generation is not related to the instruction...
+Please check if you are loading Qwen-7B-Chat instead of Qwen-7B. Qwen-7B is the base model without alignment, which behaves differently from the SFT/Chat model.
+#### Is quantization supported?
+Yes, the quantization is supported by `bitsandbytes`. We are working on an improved version and will release the quantized model checkpoints.
+#### Errors in running quantized models: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
+For Linux users，running `pip install bitsandbytes` directly can solve the problem. For Windows users, you can run `python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`·
+#### Slow when processing long sequences
+We solved this problem. Updating the code to the latest version can help.
+#### Unsatisfactory performance in processing long sequences
+Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `config.json` should be set to `true` (`true` by default).
+<br><br>
+## Finetuning
+#### Can Qwen support SFT or even RLHF?
+We do not provide finetuning or RLHF codes for now. However, some projects have supported finetuning, see [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc. We will soon update the relevant codes.
+<br><br>
+## Tokenizer
+#### bos_id/eos_id/pad_id not found
+In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer.

FAQ_ja.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# FAQ
+## インストールと環境
+#### Flash attention 導入の失敗例
+Flash attention は、トレーニングと推論を加速するオプションです。H100、A100、RTX 3090、T4、RTX 2080 などの Turing、Ampere、Ada、および Hopper アーキテクチャの NVIDIA GPU だけが、flash attention をサポートできます。それをインストールせずに私たちのモデルを使用することができます。
+#### transformers のバージョンは？
+4.31.0 が望ましいです。
+#### コードとチェックポイントをダウンロードしましたが、モデルをローカルにロードできません。どうすればよいでしょうか？
+コードを最新のものに更新し、すべてのシャードされたチェックポイントファイルを正しくダウンロードしたかどうか確認してください。
+#### `qwen.tiktoken` が見つかりません。これは何ですか？
+これはトークナイザーのマージファイルです。ダウンロードする必要があります。[git-lfs](https://git-lfs.com) を使わずにリポジトリを git clone しただけでは、このファイルをダウンロードできないことに注意してください。
+#### transformers_stream_generator/tiktoken/accelerate が見つかりません。
+コマンド `pip install -r requirements.txt` を実行してください。このファイルは [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) にあります。
+<br><br>
+## デモと推論
+#### デモはありますか？CLI と Web UI のデモはありますか？
+はい、Web デモは `web_demo.py` を、CLI デモは `cli_demo.py` を参照してください。詳しくは README を参照してください。
+#### CPU のみを使うことはできますか？
+はい、`python cli_demo.py --cpu-only` を実行すると、CPU のみでモデルと推論をロードします。
+#### Qwen はストリーミングに対応していますか？
+`modeling_qwen.py` の `chat_stream` 関数を参照してください。
+#### chat_stream() を使用すると、結果に文字化けが発生します。
+これは、トークンがバイトを表し、単一のトークンが無意味な文字列である可能性があるためです。このようなデコード結果を避けるため、トークナイザのデフォルト設定を更新しました。コードを最新版に更新してください。
+#### インストラクションとは関係ないようですが...
+Qwen-7B ではなく Qwen-7B-Chat を読み込んでいないか確認してください。Qwen-7B はアライメントなしのベースモデルで、SFT/Chat モデルとは挙動が異なります。
+#### 量子化はサポートされていますか？
+はい、量子化は `bitsandbytes` でサポートされています。私たちは改良版の開発に取り組んでおり、量子化されたモデルのチェックポイントをリリースする予定です。
+#### 量子化モデル実行時のエラー: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
+Linux ユーザの場合は，`pip install bitsandbytes` を直接実行することで解決できます。Windows ユーザの場合は、`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui` を実行することができます。
+#### 長いシーケンスの処理に時間がかかる
+この問題は解決しました。コードを最新版に更新することで解決します。
+#### 長いシーケンスの処理で不満足なパフォーマンス
+NTK が適用されていることを確認してください。`config.json` の `use_dynamc_ntk` と `use_logn_attn` を `true` に設定する必要があります（デフォルトでは `true`）。
+<br><br>
+## ファインチューニング
+#### Qwen は SFT、あるいは RLHF に対応できますか？
+今のところ、ファインチューニングや RLHF のコードは提供していません。しかし、[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))など、いくつかのプロジェクトではファインチューニングをサポートしています。近日中に関連コードを更新する予定です。
+<br><br>
+## トークナイザー
+#### bos_id/eos_id/pad_id が見つかりません。
+私たちのトレーニングでは、セパレータとパディングトークンとして `<|endoftext|>` のみを使用しています。bos_id、eos_id、pad_id は tokenizer.eod_id に設定できます。私たちの���ークナイザーについて詳しくは、トークナイザーについてのドキュメントをご覧ください。

FAQ_zh.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# FAQ
+## 安装&环境
+#### flash attention 安装失败
+flash attention是一个用于加速模型训练推理的可选项，且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡（如H100、A100、RTX 3090、T4、RTX 2080），您可以在不安装flash attention的情况下正常使用模型进行推理。
+#### 我应该用哪个transformers版本？
+建议使用4.31.0。
+#### 我把模型和代码下到本地，按照教程无法使用，该怎么办？
+答：别着急，先检查你的代码是不是更新到最新版本，然后确认你是否完整地将模型checkpoint下到本地。
+#### `qwen.tiktoken`这个文件找不到，怎么办？
+这个是我们的tokenizer的merge文件，你必须下载它才能使用我们的tokenizer。注意，如果你使用git clone却没有使用git-lfs，这个文件不会被下载。如果你不了解git-lfs，可点击[官网](https://git-lfs.com/)了解。
+#### transformers_stream_generator/tiktoken/accelerate，这几个库提示找不到，怎么办？
+运行如下命令：`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) 可以找到。
+<br><br>
+## Demo & 推理
+#### 是否提供Demo？CLI Demo及Web UI Demo？
+`web_demo.py`和`cli_demo.py`分别提供了Web UI以及CLI的Demo。请查看README相关内容了解更多。
+#### 我没有GPU，只用CPU运行CLI demo可以吗？
+可以的，运行`python  cli_demo.py --cpu-only`命令即可将模型读取到CPU并使用CPU进行推理。
+#### Qwen支持流式推理吗？
+Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数。
+#### 使用`chat_stream()`生成混乱的内容及乱码，为什么？
+这是由于模型生成过程中输出的部分token需要与后续token一起解码才能输出正常文本，单个token解码结果是无意义字符串，我们已经更新了tokenizer解码时的默认设置，避免这些字符串在生成结果中出现，如果仍有类似问题请更新模型至最新版本。
+#### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的
+请检查是否加载的是Qwen-7B-Chat模型进行推理，Qwen-7B模型是未经align的预训练基模型，不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查，避免您误将预训练模型作为SFT/Chat模型使用。
+#### 是否有量化版本模型
+目前Qwen支持基于`bitsandbytes`的8-bit和4-bit的量化推理。后续我们将进一步更新提供更加高效的量化推理实现，并提供对应的量化模型。
+#### 运行量化推理报错：`importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
+对于linux 用户，直接`pip install bitsandbytes`即可。对于windows用户，可以 运行`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`。
+#### 生成序列较长后速度显著变慢
+这一问题已经在最新版本中修复。请更新到最新代码。
+#### 处理长序列时效果有问题
+请确认是否开启ntk。若要启用这些技巧，请将`config.json`里的`use_dynamc_ntk`和`use_logn_attn`设置为`true`。最新代码默认为`true`。
+<br><br>
+## 微调
+#### 当前是否支持SFT和RLHF？
+我们目前未提供SFT和RLHF代码。当前有多个外部项目已实现支持，如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。
+<br><br>
+## Tokenizer
+#### bos_id/eos_id/pad_id，这些token id不存在，为什么？
+在训练过程中，我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符，你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档，了解如何设置这些id。

LICENSE ADDED Viewed

	@@ -0,0 +1,53 @@

+Tongyi Qianwen LICENSE AGREEMENT
+Tongyi Qianwen Release Date: August 3, 2023
+By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
+1. Definitions
+    a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
+    b. "We"(or "Us") shall mean Alibaba Cloud.
+    c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
+    d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
+    e. "Tongyi Qianwen" shall mean the large language models (including Qwen-7B model and Qwen-7B-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
+    f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
+    g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
+    h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+2. Grant of Rights
+You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
+3. Redistribution
+You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+    a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
+    b. You shall cause any modified files to carry prominent notices stating that You changed the files;
+    c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
+    d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
+4. Restrictions
+If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
+5. Rules of use
+    a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
+    b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
+6. Intellectual Property
+    a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
+    b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
+    c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
+7. Disclaimer of Warranty and Limitation of Liability
+    a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
+    b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
+    c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
+    d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
+8. Survival and Termination.
+    a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
+    b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
+9. Governing Law and Jurisdiction.
+    a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
+    b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.

NOTICE ADDED Viewed

	@@ -0,0 +1,52 @@

+------------- LICENSE FOR NVIDIA Megatron-LM code  --------------
+Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+  * Redistributions of source code must retain the above copyright
+    notice, this list of conditions and the following disclaimer.
+  * Redistributions in binary form must reproduce the above copyright
+    notice, this list of conditions and the following disclaimer in the
+    documentation and/or other materials provided with the distribution.
+  * Neither the name of NVIDIA CORPORATION nor the names of its
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+------------- LICENSE FOR OpenAI tiktoken code  --------------
+MIT License
+Copyright (c) 2022 OpenAI, Shantanu Jain
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,449 @@
 ---
-title: Cmkj Gpt
-emoji: 🌍
-colorFrom: gray
-colorTo: red
 sdk: gradio
 sdk_version: 3.41.2
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: cmkj-gpt
+app_file: web_demo.py
 sdk: gradio
 sdk_version: 3.41.2
 ---
+<p align="left">
+        <a href="README_CN.md">中文</a>&nbsp ｜ &nbspEnglish&nbsp ｜ &nbsp<a href="README_JA.md">日本語</a>
+</p>
+<br><br>
+<p align="center">
+    <img src="assets/logo.jpg" width="400"/>
+<p>
+<br>
+<p align="center">
+        Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp ｜ Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>
+<br>
+<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp ｜ &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>
+</p>
+<br><br>
+We opensource **Qwen-7B** and **Qwen-7B-Chat** on both **🤖 ModelScope** and **🤗 Hugging Face** (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo [link](tech_memo.md) that provides more information.
+Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include:
+1. **Trained with high-quality pretraining data**. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data.
+2. **Strong performance**. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.
+3. **Better support of languages**. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.
+4. **Support of 8K Context Length**. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
+5. **Support of Plugins**. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.
+The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.
+<br>
+## News and Updates
+* 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat, **Qwen-7B-Chat-Int4**, which requires low memory costs but achieves improved inference speed. Besides, there is no significant performance degradation on the benchmark evaluation.
+* 2023.8.3 We release both **Qwen-7B** and **Qwen-7B-Chat** on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
+## Performance
+In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below.
+| Model        |   MMLU   |  C-Eval  |  GSM8K  | HumanEval | WMT22 (en-zh) |  CMMLU  |
+| :------------- | :--------: | :--------: | :--------: | :---------: | :-------------: | :--------: |
+| LLaMA-7B     |   35.1   |    -    |   11.0   |   10.5   |      8.7      |    -    |
+| LLaMA 2-7B   |   45.3   |    -    |   14.6   |   12.8   |     17.9     |    -    |
+| Baichuan-7B  |   42.3   |   42.8   |   9.7   |    9.2    |     26.6     |   44.4   |
+| ChatGLM2-6B  |   47.9   |   51.7   |   32.4   |    9.2    |       -       |   48.8   |
+| InternLM-7B  |   51.0   |   52.8   |   31.2   |   10.4   |     14.8     |    -    |
+| Baichuan-13B |   51.6   |   53.6   |   26.6   |   12.8   |     30.0     |   55.8   |
+| LLaMA-13B    |   46.9   |   35.5   |   17.8   |   15.8   |     12.0     |    -    |
+| LLaMA 2-13B  |   54.8   |    -    |   28.7   |   18.3   |     24.2     |    -    |
+| ChatGLM2-12B |   56.2   | **61.6** |   40.9   |     -     |       -       |    -    |
+| **Qwen-7B**  | **56.7** |   59.6   | **51.6** | **24.4** |   **30.6**   | **58.8** |
+<p align="center">
+    <img src="assets/performance.png" width="1000"/>
+<p>
+<br>
+Additionally, according to the third-party evaluation of large language models, conducted by [OpenCompass](https://opencompass.org.cn/leaderboard-llm), Qwen-7B and Qwen-7B-Chat are the top 7B-parameter models. This evaluation consists of a large amount of public benchmarks for the evaluation of language understanding and generation, coding, mathematics, reasoning, etc.
+For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md).
+<br>
+## Requirements
+* python 3.8 and above
+* pytorch 1.12 and above, 2.0 and above are recommended
+* CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
+  <br>
+## Quickstart
+Below, we provide simple examples to show how to use Qwen-7B with 🤖 ModelScope and 🤗 Transformers.
+Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
+```bash
+pip install -r requirements.txt
+```
+If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
+```bash
+git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
+cd flash-attention && pip install .
+# Below are optional. Installing them might be slow.
+# pip install csrc/layer_norm
+# pip install csrc/rotary
+```
+Now you can start with ModelScope or Transformers.
+#### 🤗 Transformers
+To use Qwen-7B-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+# Note: The default behavior now has injection attack prevention off.
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
+# Specify hyperparameters for generation
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# 第一轮对话 1st dialogue turn
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+# 你好！很高兴为你提供帮助。
+# 第二轮对话 2nd dialogue turn
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
+# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
+# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
+# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
+# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
+# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
+# 第三轮对话 3rd dialogue turn
+response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
+print(response)
+# 《奋斗创业：一个年轻人的成功之路》
+```
+Running Qwen-7B pretrained base model is also simple.
+<details>
+  <summary>Running Qwen-7B</summary>
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+# use bf16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# use fp16
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# use cpu only
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
+# use auto mode, automatically select precision based on the device.
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
+# Specify hyperparameters for generation
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
+inputs = inputs.to(model.device)
+pred = model.generate(**inputs)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
+```
+</details>
+#### 🤖 ModelScope
+ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
+```python
+import os
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+from modelscope import snapshot_download
+model_id = 'QWen/qwen-7b-chat'
+revision = 'v1.0.0'
+model_dir = snapshot_download(model_id, revision)
+pipe = pipeline(
+task=Tasks.chat, model=model_dir, device_map='auto')
+history = None
+text = '浙江的省会在哪里？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+text = '它有什么好玩的地方呢？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+```
+<br>
+## Tokenizer
+Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
+<br>
+## Quantization
+### Usage
+**Note: we provide a new solution based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), and release an Int4 quantized model for Qwen-7B-Chat [Click here](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4), which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution.**
+Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of AutoGPTQ and install it from source (temporarily the codes for Qwen are not yet released in the latest version of PyPI package):
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
+```
+Then you can load the quantized model easily as shown below:
+```python
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
+```
+To run inference, it is similar to the basic usage demonstrated above, but remember to pass in the generation configuration explicitly:
+```python
+from transformers import GenerationConfig
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
+```
+### Performance
+We illustrate the model performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
+| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
+| -------------- | :----: | :-----------: | :-----: | :---------: |
+| BF16         | 53.9 |    54.2    | 41.1 |   24.4   |
+| Int4         | 52.6 |    52.9    | 38.1 |   23.8   |
+### Inference Speed
+We measured the average inference speed (tokens/s) of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization, respectively.
+| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
+| -------------- | :-------------------: | :-------------------: |
+| BF16         |        30.34        |        29.32        |
+| Int4         |        43.56        |        33.92        |
+In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged over the generated 8192 tokens.
+### GPU Memory Usage
+We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
+| Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| -------------- | :-----------------------------------: | :-------------------------------------: |
+| BF16         |               17.66GB               |                22.58GB                |
+| Int4         |               8.21GB                |                13.62GB                |
+The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
+<br>
+## Demo
+### Web UI
+We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
+```
+pip install -r requirements_web_demo.txt
+```
+Then run the command below and click on the generated link:
+```
+python web_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+### CLI Demo
+We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
+```
+python cli_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+## API
+We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:
+```bash
+pip install fastapi uvicorn openai pydantic sse_starlette
+```
+Then run the command to deploy your API:
+```bash
+python openai_api.py
+```
+You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.
+Using the API is also simple. See the example below:
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+# create a request activating streaming response
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True
+    # Specifying stop words in streaming output format is not yet supported and is under development.
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+# create a request not activating streaming response
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
+)
+print(response.choices[0].message.content)
+```
+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>
+Function calling is also supported (but only when `stream=False` for the moment). See the [example usage](examples/function_call_examples.py) here.
+## Deployment
+It is simple to run the model on CPU, which requires your specification of device:
+```python
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+```
+If you suffer from lack of GPU memory and you would like to run the model on more than 1 GPU, you can use our provided script `utils.py`:
+```python[](https://)
+from utils import load_model_on_gpus
+model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
+```
+Then you can run the 7B chat model on 2 GPUs using the above scripts.
+<br>
+## Tool Usage
+Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
+| Model            | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
+|:-----------------| :-----------------------: | :----------------------: | :----------------------: |
+| GPT-4            |           95%           |        **0.90**        |          15%          |
+| GPT-3.5          |           85%           |          0.88          |          75%          |
+| **Qwen-7B-Chat** |         **99%**         |          0.89          |        **9.7%**        |
+For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.
+Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows:
+| Model            | Tool Selection↑ | Tool Used↑ |  Code↑  |
+|:-----------------| :----------------: | :-----------: | :---------: |
+| GPT-4            |     **100**     |   **100**   | **97.41** |
+| GPT-3.5          |      95.37      |    96.30    |   87.04   |
+| StarCoder-15.5B  |      87.04      |    87.96    |   68.89   |
+| **Qwen-7B-Chat** |      90.74      |    92.59    |   74.07   |
+<br>
+## Long-Context Understanding
+To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen-7B can reach outstanding performance in the scenario of long context. Results are demonstrated below:
+<table>
+    <tr>
+        <th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th>
+    </tr>
+    <tr>
+        <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th>
+    </tr>
+    <tr>
+        <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td>
+    </tr>
+</table>
+<br><br>
+## Reproduction
+For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. Note that the reproduction may lead to slight differences from our reported results.
+<br>
+## FAQ
+If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.
+<br>
+## License Agreement
+Researchers and developers are free to use the codes and model weights of both Qwen-7B and Qwen-7B-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
+<br>
+## Contact Us
+If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.

README_CN.md ADDED Viewed

	@@ -0,0 +1,451 @@

+<p align="left">
+        中文</a>&nbsp ｜ &nbsp<a href="README.md">English</a>&nbsp ｜ &nbsp<a href="README_JA.md">日本語</a>
+</p>
+<br><br>
+<p align="center">
+    <img src="assets/logo.jpg" width="400"/>
+<p>
+<br>
+<p align="center">
+        Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp ｜ Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>
+<br>
+<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp ｜ &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>
+</p>
+<br><br>
+我们在🤖 **ModelScope**以及🤗 **Hugging Face**均开源了**Qwen-7B**系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息，请点击[链接](tech_memo.md)查看我们的技术备忘录。
+通义千问-7B（Qwen-7B） 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。同时，在Qwen-7B的基础上，我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括：
+1. **大规模高质量预训练数据**：我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型，覆盖通用领域和专业领域。
+2. **优秀的模型性能**：相比同规模的开源模型，Qwen-7B在多个评测数据集上具有显著优势，甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。
+3. **更好地支持多语言**：基于更大词表的分词器在分词上更高效，同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。
+4. **8K的上下文长度**：Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。
+5. **支持插件调用**：Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化，当前模型能有效调用插件以及升级为Agent。
+以下章节的信息可能对你有帮助，建议阅读。如果你在使用过程遇到问题，建议先查询FAQ，如仍无法解决再提交issue。
+<br>
+## 新闻
+* 2023年8月21日 发布Qwen-7B-Chat的Int4量化模型，Qwen-7B-Chat-Int4。该模型显存占用低，推理速度相比半精度模型显著提升，在基准评测上效果损失较小。
+* 2023年8月3日 在魔搭社区（ModelScope）和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时，我们发布了技术备忘录，介绍了相关的训练细节和模型表现。
+<br>
+## 评测表现
+Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、代码生成等能力的评测数据集上，包括MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU等，均超出了同规模大语言模型的表现，甚至超出了如12-13B参数等更大规模的语言模型。
+| Model             | MMLU           |         C-Eval |          GSM8K |      HumanEval |  WMT22 (en-zh) |         CMMLU |
+| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
+| LLaMA-7B          | 35.1           |              - |           11.0 |           10.5 |            8.7 |             - |
+| LLaMA 2-7B        | 45.3           |              - |           14.6 |           12.8 |           17.9 |             - |
+| Baichuan-7B       | 42.3           |           42.8 |            9.7 |            9.2 |           26.6 |          44.4 |
+| ChatGLM2-6B       | 47.9           |           51.7 |           32.4 |            9.2 |              - |          48.8 |
+| InternLM-7B       | 51.0           |           52.8 |           31.2 |           10.4 |           14.8 |             - |
+| Baichuan-13B      | 51.6           |           53.6 |           26.6 |           12.8 |           30.0 |          55.8 |
+| LLaMA-13B         | 46.9           |           35.5 |           17.8 |           15.8 |           12.0 |             - |
+| LLaMA 2-13B       | 54.8           |              - |           28.7 |           18.3 |           24.2 |             - |
+| ChatGLM2-12B      | 56.2           |       **61.6** |           40.9 |              - |              - |             - |
+| **Qwen-7B**       | **56.7**       |           59.6 |       **51.6** |       **24.4** |       **30.6** |      **58.8** |
+<p align="center">
+    <img src="assets/performance.png" width="1000"/>
+<p>
+<br>
+此外，根据[OpenCompass](https://opencompass.org.cn/leaderboard-llm)进行的大型语言模型第三方评估，Qwen-7B 和 Qwen-7B-Chat 是其中表现最优的7B参数模型。该评估由大量公开基准组成，用于评估语言理解和生成、代码生成、数学、推理等。
+更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。
+<br>
+## 要求
+* python 3.8及以上版本
+* pytorch 1.12及以上版本，推荐2.0及以上版本
+* 建议使用CUDA 11.4及以上（GPU用户、flash-attention用户等需考虑此选项）
+<br>
+## 快速使用
+我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用Qwen-7B和Qwen-7B-Chat。
+在开始前，请确保你已经配置好环境并安装好相关的代码包。最重要的是，确保你满足上述要求，然后安装相关的依赖库。
+```bash
+pip install -r requirements.txt
+```
+如果你的显卡支持fp16或bf16精度，我们还推荐安装[flash-attention](https://github.com/Dao-AILab/flash-attention)来提高你的运行效率以及降低显存占用。(**flash-attention只是可选项，不安装也可正常运行该项目**)
+```bash
+git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
+cd flash-attention && pip install .
+# 下方安装可选，安装可能比较缓慢。
+# pip install csrc/layer_norm
+# pip install csrc/rotary
+```
+接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。
+#### 🤗 Transformers
+如希望使用Qwen-7B-chat进行推理，所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+# 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# 使用CPU进行推理，需要约32GB内存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# 默认使用自动模式，根据设备自动选择精度
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
+# 可指定不同的生成长度、top_p等相关超参
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# 第一轮对话 1st dialogue turn
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+# 你好！很高兴为你提供帮助。
+# 第二轮对话 2nd dialogue turn
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+# 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
+# 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
+# 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
+# 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
+# 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
+# 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
+# 第三轮对话 3rd dialogue turn
+response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
+print(response)
+# 《奋斗创业：一个年轻人的成功之路》
+```
+运行Qwen-7B同样非常简单。
+<details>
+  <summary>运行Qwen-7B</summary>
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+# 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# 使用CPU进行推理，需要约32GB内存
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
+# 默认使用自动模式，根据设备自动选择精度
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
+# 可指定不同的生成长度、top_p等相关超参
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+inputs = tokenizer('蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是', return_tensors='pt')
+inputs = inputs.to(model.device)
+pred = model.generate(**inputs)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+# 蒙古国的首都是乌兰巴托（Ulaanbaatar）\n冰岛的首都是雷克雅未克（Reykjavik）\n埃塞俄比亚的首都是亚的斯亚贝巴（Addis Ababa）...
+```
+</details>
+#### 🤖 ModelScope
+魔搭（ModelScope）是开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单，代码如下所示：
+```python
+import os
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+from modelscope import snapshot_download
+model_id = 'QWen/qwen-7b-chat'
+revision = 'v1.0.0'
+model_dir = snapshot_download(model_id, revision)
+pipe = pipeline(
+task=Tasks.chat, model=model_dir, device_map='auto')
+history = None
+text = '浙江的省会在哪里？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+text = '它有什么好玩的地方呢？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+```
+<br>
+## Tokenization
+> 注：作为术语的“tokenization”在中文中尚无共识的概念对应，本文档采用英文表达以利说明。
+基于tiktoken的tokenizer有别于其他分词器，比如sentencepiece tokenizer。尤其在微调阶段，需要特别注意特殊token的使用。关于tokenizer的更多信息，以及微调时涉及的相关使用，请参阅[文档](tokenization_note_zh.md)。
+<br>
+## 量化
+### 用法
+**请注意：我们更新量化方案为基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的量化，提供Qwen-7B-Chat的Int4量化模型[点击这里](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)。相比此前方案，该方案在模型评测效果几乎无损，且存储需求更低，推理速度更优。**
+以下我们提供示例说明如何使用Int4量化模型。在开始使用前，请先保证满足AutoGPTQ的要求，并使用源代码安装（由于最新支持Qwen的代码未发布到PyPI）：
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
+```
+随后便能轻松读取量化模型：
+```python
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
+```
+推理方法和基础用法类似，但注意需要从外部传入generation config：
+```python
+from transformers import GenerationConfig
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
+```
+### 效果评测
+我们对BF16和Int4模型在基准评测上做了测试，发现量化模型效果损失较小，结果如下所示：
+|  Quantization |   MMLU     |  CEval (val) |  GSM8K |  Humaneval |
+| ------------- | :--------: | :----------: | :----: | :--------: |
+| BF16          |    53.9    |     54.2     |  41.1  |    24.4    |
+| Int4          |    52.6    |     52.9     |  38.1  |    23.8    |
+### 推理速度
+我们测算了BF16和Int4模型生成2048和8192个token的平均推理速度（tokens/s）。如图所示：
+|  Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
+| ------------- | :------------------:| :------------------:|
+|      BF16     | 30.34               | 29.32               |
+|      Int4     | 43.56               | 33.92               |
+具体而言，我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
+### 显存使用
+我们还测算了BF16和Int4模型编码2048个token及生成8192个token的峰值显存占用情况。结果如下所示：
+| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| ------------------ | :---------------------------------: | :-----------------------------------: |
+| BF16               |               17.66GB               |                22.58GB                |
+| Int4               |               8.21GB                |                13.62GB                |
+上述性能测算使用[此脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)完成。
+<br>
+## Demo
+### Web UI
+我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前，确保已经安装如下代码库：
+```
+pip install -r requirements_web_demo.txt
+```
+随后运行如下命令，并点击生成链接：
+```
+python web_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+### 交互式Demo
+我们提供了一个简单的交互式Demo示例，请查看`cli_demo.py`。当前模型已经支持流式输出，用户可通过输入文字的方式和Qwen-7B-Chat交互，模型将流式输出返回结果。运行如下命令：
+```
+python cli_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+## API
+我们提供了OpenAI API格式的本地API部署方法（感谢@hanpenggit）。在开始之前先安装必要的代码库：
+```bash
+pip install fastapi uvicorn openai pydantic sse_starlette
+```
+随后即可运行以下命令部署你的本地API：
+```bash
+python openai_api.py
+```
+你也可以修改参数，比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题，更新上述代码库往往可以解决大多数问题。
+使用API同样非常简单，示例如下：
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+# 使用流式回复的请求
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True
+    # 流式输出的自定义stopwords功能尚未支持，正在开发中
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+# 不使用流式回复的请求
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop=[] # 在此处添加自定义的stop words 例如ReAct prompting时需要增加： stop=["Observation:"]。
+)
+print(response.choices[0].message.content)
+```
+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>
+该接口也支持函数调用（Function Calling），但暂时仅限 `stream=False` 时能生效。用法见[函数调用示例](examples/function_call_examples.py)。
+## 部署
+在CPU上运行非常简单，使用方法如下所示：
+```python
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+```
+如果你遇到显存不足的问题而希望使用多张GPU进行推理，可以使用提供的脚本`utils.py`:
+```python
+from utils import load_model_on_gpus
+model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
+```
+你即可使用2张GPU进行推理。
+<br>
+## 工具调用
+Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。在我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力，并发现Qwen-7B-Chat能够取得稳定的表现。
+| Model            | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑)  | False Positive Error↓  |
+|:-----------------|:----------------------:|:----------------------:|:----------------------:|
+| GPT-4            | 95%                    | **0.90**               | 15%                    |
+| GPT-3.5          | 85%                    | 0.88                   | 75%                    |
+| **Qwen-7B-Chat** | **99%**                | 0.89                   | **9.7%**               |
+我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。
+For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md)。
+此外，我们还提供了实验结果表明我们的模型扮演Agent的能力。请阅读相关文档[链接](https://huggingface.co/docs/transformers/transformers_agents)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下：
+| Model            | Tool Selection↑ | Tool Used↑  |   Code↑   |
+|:-----------------|:---------------:|:-----------:|:---------:|
+| GPT-4            |     **100**     |   **100**   | **97.41** |
+| GPT-3.5          |      95.37      |    96.30    |   87.04   |
+| StarCoder-15.5B  |      87.04      |    87.96    |   68.89   |
+| **Qwen-7B-Chat** |      90.74      |    92.59    |   74.07   |
+<br>
+## 长文本理解
+我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。我们的模型已经突破8K的序列长度。通过arXiv数据集上的语言模型实验，我们发现Qwen-7B能够在长序列的设置下取得不错的表现。
+<table>
+    <tr>
+        <th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th>
+    </tr>
+    <tr>
+        <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th>
+    </tr>
+    <tr>
+        <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn + local_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td>
+    </tr>
+</table>
+<br>
+## 复现
+我们提供了评测脚本以供复现我们的实验结果。注意，由于内部代码和开源代码存在少许差异，评测结果可能与汇报结果存在细微的结果不一致。请阅读[eval/EVALUATION.md](eval/EVALUATION.md)了解更多信息。
+<br>
+## FAQ
+如遇到问题，敬请查阅[FAQ](FAQ_zh.md)以及issue区，如仍无法解决再提交issue。
+<br>
+## 使用协议
+研究人员与开发者可使用Qwen-7B和Qwen-7B-Chat或进行二次开发。我们同样允许商业使用，具体细节请查看[LICENSE](LICENSE)。如需商用，请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
+<br>
+## 联系我们
+如果你想给我们的研发团队和产品团队留言，请通过邮件（qianwen_opensource@alibabacloud.com）联系我们。

README_JA.md ADDED Viewed

	@@ -0,0 +1,448 @@

+<p align="left">
+        <a href="README_CN.md">中文</a>&nbsp ｜ &nbsp<a href="README.md">English</a>&nbsp ｜ &nbsp日本語
+</p>
+<br><br>
+<p align="center">
+    <img src="assets/logo.jpg" width="400"/>
+<p>
+<br>
+<p align="center">
+        Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp ｜ Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | Qwen-7B-Chat-Int4 <a href="https://huggingface.co/Qwen/Qwen-7B-Chat-Int4">🤗</a>
+<br>
+<a href="assets/wechat.png">WeChat</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp ｜ &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>
+</p>
+<br>
+<p align="left">
+        日本語ドキュメントメンテナー: <a href="https://github.com/eltociear">Ikko Eltociear Ashimine</a> & Junyang Lin
+</p>
+<br>
+私たちは、**Qwen-7B** と **Qwen-7B-Chat** を **🤖 ModelScope** と **🤗 Hugging Face** の両方でオープンソース化しています(上部のロゴをクリックすると、コードとチェックポイントのあるリポジトリに移動します)。このレポには、Qwen-7B の簡単な紹介と、使い方の手引き、さらに詳しい情報を提供する技術メモ [link](tech_memo.md) が含まれています。
+Qwen-7B は、アリババクラウドが提唱する大規模言語モデルシリーズ Qwen（略称：Tongyi Qianwen）の7Bパラメータ版になります。Qwen-7B は Transformer ベースの大規模言語モデルであり、ウェブテキスト、書籍、コードなどを含む大量のデータで事前学習されています。さらに、事前学習された Qwen-7B をベースに、アライメント技術で学習された大規模モデルベースの AI アシスタントである Qwen-7B-Chat をリリースします。Qwen-7B シリーズの特徴は以下の通りです:
+1. **高品質な事前トレーニングデータでトレーニング**。Qwen-7B は 2.2 兆以上のトークンを含む大規模で高品質なデータセットに対して事前学習を行っっています。このデータセットには平文とコードが含まれ、一般的なドメインデータと専門的なドメインデータを含む幅広いドメインをカバーしている。
+2. **強いパフォーマンス**。自然言語理解、数学、コーディングなどを評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのモデルと比較して、競合他社を凌駕しています。
+3. **言語サポートの向上**。Qwen-7B のトークナイザは、15 万以上のトークンの語彙をベースにしており、他のトークナイザに比べて効率的です。多くの言語に対応しており、ユーザが特定の言語を理解するために Qwen-7B をさらにファインチューニングするのに役立ちます。
+4. **8K コンテキスト長をサポート**。Qwen-7B と Qwen-7B-Chat はともに 8K のコンテキスト長をサポートしており、長いコンテキストでの入力を可能にしている。
+5. **プラグインのサポート**。Qwen-7B-Chat は、プラグイン関連のアライメントデータでトレーニングされているため、API、モデル、データベースなどのツールを使用することができ、エージェントとしてプレイすることができる。
+以下のセクションには、参考になる情報が記載されています。特に、issue を立ち上げる前に FAQ セクションをお読みになることをお勧めします。
+<br>
+## ニュースとアップデート
+* 2023.8.21 Qwen-7B-Chat 用 Int4 量子化モデル **Qwen-7B-Chat-Int4** をリリースしました。また、ベンチマーク評価においても大きな性能低下は見られませんでした。
+* 2023.8.3 ModelScope と Hugging Face 上で **Qwen-7B** と **Qwen-7B-Chat** をリリースしました。また、トレーニングの詳細やモデルの性能など、モデルの詳細については技術メモを提供しています。
+<br>
+## パフォーマンス
+一般的に、Qwen-7B は、MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU など、自然言語理解、数学的問題解決、コーディングなどに関するモデルの能力を評価する一連のベンチマークデータセットにおいて、同程度のモデルサイズのベースラインモデルを凌駕しており、さらには 13B 程度のパラメータを持つより大規模なモデルをも凌駕しています。以下の結果をご覧ください。
+| Model             | MMLU           |         C-Eval |          GSM8K |      HumanEval |  WMT22 (en-zh) |         CMMLU |
+| :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
+| LLaMA-7B          | 35.1           |              - |           11.0 |           10.5 |            8.7 |             - |
+| LLaMA 2-7B        | 45.3           |              - |           14.6 |           12.8 |           17.9 |             - |
+| Baichuan-7B       | 42.3           |           42.8 |            9.7 |            9.2 |           26.6 |          44.4 |
+| ChatGLM2-6B       | 47.9           |           51.7 |           32.4 |            9.2 |              - |          48.8 |
+| InternLM-7B       | 51.0           |           52.8 |           31.2 |           10.4 |           14.8 |             - |
+| Baichuan-13B      | 51.6           |           53.6 |           26.6 |           12.8 |           30.0 |          55.8 |
+| LLaMA-13B         | 46.9           |           35.5 |           17.8 |           15.8 |           12.0 |             - |
+| LLaMA 2-13B       | 54.8           |              - |           28.7 |           18.3 |           24.2 |             - |
+| ChatGLM2-12B      | 56.2           |       **61.6** |           40.9 |              - |              - |             - |
+| **Qwen-7B**       | **56.7**       |           59.6 |       **51.6** |       **24.4** |       **30.6** |      **58.8** |
+<p align="center">
+    <img src="assets/performance.png" width="1000"/>
+<p>
+<br>
+さらに、[OpenCompass](https://opencompass.org.cn/leaderboard-llm) が実施した大規模言語モデルの第三者評価によると、Qwen-7B と Qwen-7B-Chat は 7B パラメータモデルのトップになります。この評価は、言語理解・生成、コーディング、数学、推論などの評価のための大量の公開ベンチマークで構成されています。
+より詳細な実験結果（より多くのベンチマークデータセットでの詳細なモデル性能）や詳細については、[こちら](tech_memo.md)をクリックして技術メモを参照してください。
+<br>
+## 必要条件
+* python 3.8 以上
+* pytorch 1.12 以上、2.0 以上を推奨
+* CUDA 11.4 以上を推奨（GPU ユーザー、フラッシュアテンションユーザー向けなど）
+<br>
+## クイックスタート
+以下では、Qwen-7B と 🤖 ModelScope と 🤗 Transformers の簡単な使用例を示します。
+コードを実行する前に、環境のセットアップと必要なパッケージのインストールが済んでいることを確認してください。上記の要件を満たしていることを確認してから、依存するライブラリをインストールしてください。
+```bash
+pip install -r requirements.txt
+```
+お使いのデバイスが fp16 または bf16 をサポートしている場合、[flash-attention](https://github.com/Dao-AILab/flash-attention) をインストールすることで、より高い効率とメモリ使用量を抑えることができます。(**flash-attention はオプションであり、インストールしなくてもプロジェクトは正常に実行できます**)
+```bash
+git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
+cd flash-attention && pip install .
+# 以下はオプションです。インストールに時間がかかる場合があります。
+# pip install csrc/layer_norm
+# pip install csrc/rotary
+```
+これで ModelScope か Transformers で始めることができます。
+#### 🤗 Transformers
+Qwen-7B-Chat を推論に使用するには、以下のように数行のコードを入力するだけです。**最新のコードを使用していることを確認してください。**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+# 注: デフォルトの動作では、インジェクション攻撃防止機能がオフになっています。
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# bf16 を使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# fp16 を使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# CPU のみ使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+# オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
+# 生成のためのハイパーパラメータを指定
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
+# 第一轮对话 第一回対話ターン
+response, history = model.chat(tokenizer, "你好", history=None)
+print(response)
+# こんにちは！ お役に立ててうれしいです。
+# 第二轮对话 第二回対話ターン
+response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
+print(response)
+# これは、自分のビジネスを始めようと奮闘し、やがて成功する若者の物語である。
+# この物語の主人公は、平凡な家庭に生まれ、平凡な労働者である両親を持つ李明である。 李明は子供の頃から起業家として成功することを目標としていた。
+# この目標を達成するため、李明は猛勉強して大学に入った。 大学時代には、さまざまな起業家コンテストに積極的に参加し、多くの賞を獲得した。 また、余暇を利用してインターンシップにも参加し、貴重な経験を積んだ。
+# 卒業後、李明は起業を決意した。 投資先を探し始めたが、何度も断られた。 しかし、彼はあきらめなかった。 彼は懸命に働き続け、ビジネスプランを改善し、新たな投資機会を探した。
+# やがて李明は投資を受けることに成功し、自分のビジネスを始めた。 彼は新しいタイプのソフトウェアの開発に焦点を当てたテクノロジー会社を設立した。 彼のリーダーシップの下、会社は急速に成長し、テクノロジー企業として成功を収めた。
+# 李明の成功は偶然ではない。 彼は勤勉で、たくましく、冒険好きで、常に学び、自分を高めている。 彼の成功はまた、努力すれば誰でも成功できることを証明している。
+# 第三轮对话 第三回対話ターン
+response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
+print(response)
+# 《起業への奮闘：ある若者の成功への道》
+```
+Qwen-7B の学習済みベースモデルの実行も簡単です。
+<details>
+  <summary>Qwen-7B の実行</summary>
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+# bf16 を使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
+# fp16 を使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
+# CPU のみ使用
+# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
+# オートモードを使用すると、デバイスに応じて自動的に精度が選択されます。
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
+# 生成のためのハイパーパラメータを指定
+model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
+inputs = tokenizer('モンゴルの首都はウランバートル（Ulaanbaatar）\nアイスランドの首都はレイキャビク（Reykjavik）\nエチオピアの首都は', return_tensors='pt')
+inputs = inputs.to(model.device)
+pred = model.generate(**inputs)
+print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
+# モンゴルの首都はウランバートル（Ulaanbaatar）\nアイスランドの首都はレイキャビク（Reykjavik）\nエチオピアの首都はアディスアベバ（Addis Ababa）...
+```
+</details>
+#### 🤖 ModelScope
+ModelScope は、MaaS（Model-as-a-Service） のためのオープンソースプラットフォームであり、AI 開発者に柔軟で費用対効果の高いモデルサービスを提供します。同様に、以下のように ModelScope でモデルを実行することができます:
+```python
+import os
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+from modelscope import snapshot_download
+model_id = 'QWen/qwen-7b-chat'
+revision = 'v1.0.0'
+model_dir = snapshot_download(model_id, revision)
+pipe = pipeline(
+task=Tasks.chat, model=model_dir, device_map='auto')
+history = None
+text = '浙江省の省都はどこですか？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+text = '何がそんなに面白いのか？'
+results = pipe(text, history=history)
+response, history = results['response'], results['history']
+print(f'Response: {response}')
+```
+<br>
+## トークナイザー
+tiktoken に基づくトークナイザーは、他のトークナイザー、例えばセンテンスピーストークナイザーとは異なります。特にファインチューニングの際には、特殊なトークンに注意を払う必要があります。トークナイザに関する詳細な情報や、ファインチューニングにおける使用方法については、[ドキュメント](tokenization_note_ja.md)を参照してく���さい。
+<br>
+## 量子化
+### 使用方法
+**注: [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) に基づく新しい解決策を提供し、Qwen-7B-Chat 用の Int4 量子化モデル[ここをクリック](https://huggingface.co/Qwen/Qwen-7B-Chat-Int4)をリリースしました。このモデルは、従来の解決策と比較して、ほぼ無損失のモデル効果を達成しつつ、メモリコストと推論速度の両方で性能が向上しています。**
+ここでは、量子化されたモデルを推論に使用する方法を示します。始める前に、AutoGPTQ の要件を満たしていることを確認し、ソースからインストールしてください（一時的に Qwen のコードは最新版の PyPI パッケージではまだリリースされていません）:
+```bash
+git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
+pip install .
+```
+そうすれば、以下のように簡単に量子化モデルを読み込むことができます:
+```python
+from auto_gptq import AutoGPTQForCausalLM
+model = AutoGPTQForCausalLM.from_quantized("Qwen/Qwen-7B-Chat-Int4", device_map="auto", trust_remote_code=True, use_safetensors=True).eval()
+```
+推論を実行するには、上で示した基本的な使い方に似ていますが、generation configuration を明示的に渡すことを忘れないで下さい:
+```python
+from transformers import GenerationConfig
+config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat-Int4", trust_remote_code=True)
+response, history = model.chat(tokenizer, "Hi", history=None, generation_config=config)
+```
+### 性能
+ベンチマークにおける BF16 モデルと Int4 モデルの性能について説明します。その結果は以下に示します：
+|  Quantization |   MMLU     |  CEval (val) |  GSM8K |  Humaneval |
+| ------------- | :--------: | :----------: | :----: | :--------: |
+| BF16          |    53.9    |     54.2     |  41.1  |    24.4    |
+| Int4          |    52.6    |     52.9     |  38.1  |    23.8    |
+### 推論スピード
+BF16 の精度と Int4 の量子化レベルの下で、それぞれ 2048 個と 8192 個のトークンを生成する平均推論速度(tokens/s)を測定しました。
+|  Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
+| ------------- | :------------------:| :------------------:|
+|      BF16     | 30.34               | 29.32               |
+|      Int4     | 43.56               | 33.92               |
+詳細には、プロファイリングの設定は、1 コンテクストトークンで 8192 個の新しいトークンを生成しています。プロファイリングは、PyTorch 2.0.1 と CUDA 11.4 を搭載したシングル A100-SXM4-80G GPU で実行されました。推論速度は生成された 8192 個のトークンの平均値となります。
+### GPU メモリ使用量
+また、BF16またはInt4の量子化レベルで、それぞれ2048トークンをコンテキストとしてエンコードした場合（および単一のトークンを生成した場合）と、8192トークンを生成した場合（単一のトークンをコンテキストとして生成した場合）のGPUメモリ使用量のピーク値をプロファイリングしました。その結果を以下に示します。
+| Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+| ------------------ | :---------------------------------: | :-----------------------------------: |
+| BF16               |               17.66GB               |                22.58GB                |
+| Int4               |               8.21GB                |                13.62GB                |
+上記のスピードとメモリーのプロファイリングは、[このスクリプト](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)を使用しています。
+<br>
+## デモ
+### ウェブ UI
+ウェブ UI デモを構築するためのコードを提供します（@wysaid に感謝）。これを始める前に、以下のパッケージがインストールされていることを確認してください:
+```bash
+pip install -r requirements_web_demo.txt
+```
+そして、以下のコマンドを実行し、生成されたリンクをクリックします:
+```bash
+python web_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/web_demo.gif" width="600" />
+    <br>
+<p>
+### CLI デモ
+`cli_demo.py` に CLI のデモ例を用意しています。ユーザはプロンプトを入力することで Qwen-7B-Chat と対話することができ、モデルはストリーミングモードでモデルの出力を返します。以下のコマンドを実行する:
+```
+python cli_demo.py
+```
+<p align="center">
+    <br>
+    <img src="assets/cli_demo.gif" width="600" />
+    <br>
+<p>
+## API
+OpenAI API をベースにローカルAPIをデプロイする方法を提供する（@hanpenggit に感謝）。始める前に、必要なパッケージをインストールしてください:
+```bash
+pip install fastapi uvicorn openai pydantic sse_starlette
+```
+それから、API をデプロイするコマンドを実行します:
+```bash
+python openai_api.py
+```
+チェックポイント名やパスには `-c`、CPU デプロイメントには `--cpu-only` など、引数を変更できます。API デプロイメントを起動する際に問題が発生した場合は、パッケージを最新バージョンに更新することで解決できる可能性があります。
+API の使い方も簡単です。以下の例をご覧ください:
+```python
+import openai
+openai.api_base = "http://localhost:8000/v1"
+openai.api_key = "none"
+# ストリーミングレスポンスを有効化するリクエストを作成する
+for chunk in openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=True
+    # ストリーミング出力形式でのストップワードの指定はまだサポートされておらず、開発中です。
+):
+    if hasattr(chunk.choices[0].delta, "content"):
+        print(chunk.choices[0].delta.content, end="", flush=True)
+# ストリーミングレスポンスを有効化しないリクエストを作成する
+response = openai.ChatCompletion.create(
+    model="Qwen",
+    messages=[
+        {"role": "user", "content": "你好"}
+    ],
+    stream=False,
+    stop=[] # 例えば、stop=["Observation:"] (ReAct プロンプトの場合)。
+)
+print(response.choices[0].message.content)
+```
+<p align="center">
+    <br>
+    <img src="assets/openai_api.gif" width="600" />
+    <br>
+<p>
+## デプロイ
+CPU 上でモデルを実行するのは簡単であり、以下のようにデバイスを指定する必要があります:
+```python
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
+```
+メモリ不足に悩まされ、複数の GPU にモデルをデプロイしたい場合は、`utils.py` で提供されているスクリプトを使うことができます:
+```python
+from utils import load_model_on_gpus
+model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)
+```
+7B チャットモデルの推論を 2GPU で実行できます。
+<br>
+## ツールの使用
+Qwen-7B-Chat は、API、データベース、モデルなど、ツールの利用に特化して最適化されており、ユーザは独自の Qwen-7B ベースの LangChain、エージェント、コードインタプリタを構築することができます。ツール利用能力を評価するための評価[ベンチマーク](eval/EVALUATION.md)では、Qwen-7B は安定した性能に達しています。
+| Model            | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑)  | False Positive Error↓  |
+|:-----------------|:----------------------:|:----------------------:|:----------------------:|
+| GPT-4            | 95%                    | **0.90**               | 15%                    |
+| GPT-3.5          | 85%                    | 0.88                   | 75%                    |
+| **Qwen-7B-Chat** | **99%**                | 0.89                   | **9.7%**               |
+ReAct プロンプトの書き方や使い方については、[ReAct の例](examples/react_prompt.md)を参照してください。ツールを使用することで、モデルがよりよいタスクを実行できるようになります。
+さらに、エージェントとしての能力を示す実験結果を提供する。詳細は [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) を参照して下さい。Hugging Face が提供するランモードベンチマークでの性能は以下の通りです:
+| Model            | Tool Selection↑ | Tool Used↑  |   Code↑   |
+|:-----------------|:---------------:|:-----------:|:---------:|
+| GPT-4            |     **100**     |   **100**   | **97.41** |
+| GPT-3.5          |      95.37      |    96.30    |   87.04   |
+| StarCoder-15.5B  |      87.04      |    87.96    |   68.89   |
+| **Qwen-7B-Chat** |      90.74      |    92.59    |   74.07   |
+<br>
+## 長い文脈の理解
+コンテキストの長さを拡張し、訓練シーケンスの長さのボトルネックを解消するために、NTK を考慮した補間、ウィンドウアテンション、LogN アテンションスケーリングなどの技術を導入し、コンテキストの長さを 8K トークン以上に拡張する。arXiv データセットを用いて PPL 評価による言語モデリング実験を行い、Qwen-7B が長いコンテキストのシナリオにおいて卓越した性能を達成できることを見出した。以下に結果を示します:
+<table>
+    <tr>
+        <th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th>
+    </tr>
+    <tr>
+        <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th>
+    </tr>
+    <tr>
+        <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td>
+    </tr>
+    <tr>
+        <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td>
+    </tr>
+</table>
+<br><br>
+## 再現
+ベンチマークデータセットでのモデル性能の再現のために、結果を再現するスクリプトを提供しています。詳しくは [eval/EVALUATION.md](eval/EVALUATION.md) を確認してください。なお、再現の結果、我々の報告結果と若干異なる場合があります。
+<br>
+## FAQ
+問題が発生した場合は、まずは [FAQ](FAQ_ja.md) や issue を参照し、新しい issue を立ち上げる前に解決策を探してください。
+<br>
+## ライセンス契約
+Qwen-7B と Qwen-7B-Chat のコードとモデルウェイトは、研究者や開発者が自由に使用することができます。また、商用利用も可能です。詳しくは [LICENSE](LICENSE) をご覧ください。商用利用を希望される方は、[リクエストフォーム](https://dashscope.console.aliyun.com/openModelApply/qianwen)に必要事項をご記入の上、お申し込みください。
+<br>
+## お問い合わせ
+研究チームまたは製品チームへのメッセージは、qianwen_opensource@alibabacloud.com までお気軽にお送りください。

assets/cli_demo.gif ADDED Viewed

assets/hfagent_chat_1.png ADDED Viewed

Git LFS Details

SHA256: 356ea19c2c4a656cae9d55e2d727d1651d1955ec67385615c6582b394478e889
Pointer size: 132 Bytes
Size of remote file: 1.71 MB

assets/hfagent_chat_2.png ADDED Viewed

Git LFS Details

SHA256: 7db53a1a77dfc19072ce418db6df56fd89f9e7cb2e30430ac8320f10fc8a8bc0
Pointer size: 132 Bytes
Size of remote file: 1.93 MB

assets/hfagent_run.png ADDED Viewed

Git LFS Details

SHA256: fbf4c1232c86e334b5425aacdcc9e7a878100f80d6d70725060cb312bae7d701
Pointer size: 132 Bytes
Size of remote file: 2.77 MB

assets/logo.jpg ADDED Viewed

assets/openai_api.gif ADDED Viewed

assets/performance.png ADDED Viewed

assets/qwen_tokenizer.png ADDED Viewed

assets/react_showcase_001.png ADDED Viewed

assets/react_showcase_002.png ADDED Viewed

assets/react_tutorial_001.png ADDED Viewed

assets/react_tutorial_002.png ADDED Viewed

assets/tokenizer.pdf ADDED Viewed

Binary file (24.7 kB). View file

assets/tokenizer.png ADDED Viewed

assets/wanx_colorful_black.png ADDED Viewed

Git LFS Details

SHA256: 650a5431b1a3b4411fc4c2fd44dea3066a4ec67b03b684721086265698d738c4
Pointer size: 132 Bytes
Size of remote file: 1.33 MB

assets/web_demo.gif ADDED Viewed

assets/wechat.png ADDED Viewed

cli_demo.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""A simple command-line interactive chat demo."""
+import argparse
+import os
+import platform
+import shutil
+from copy import deepcopy
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+from transformers.trainer_utils import set_seed
+DEFAULT_CKPT_PATH = 'QWen/QWen-7B-Chat'
+_WELCOME_MSG = '''\
+Welcome to use Qwen-7B-Chat model, type text to start chat, type :h to show command help
+欢迎使用 Qwen-7B 模型，输入内容即可进行对话，:h 显示命令帮助
+'''
+_HELP_MSG = '''\
+Commands:
+    :help / :h          Show this help message              显示帮助信息
+    :exit / :quit / :q  Exit the demo                       退出Demo
+    :clear / :cl        Clear screen                        清屏
+    :clear-his / :clh   Clear history                       清除对话历史
+    :history / :his     Show history                        显示对话历史
+    :seed               Show current random seed            显示当前随机种子
+    :seed <N>           Set random seed to <N>              设置随机种子
+    :conf               Show current generation config      显示生成配置
+    :conf <key>=<value> Change generation config            修改生成配置
+    :reset-conf         Reset generation config             重置生成配置
+'''
+def _load_model_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True, resume_download=True,
+    )
+    if args.cpu_only:
+        device_map = "cpu"
+    else:
+        device_map = "auto"
+    qconfig_path = os.path.join(args.checkpoint_path, 'quantize_config.json')
+    if os.path.exists(qconfig_path):
+        from auto_gptq import AutoGPTQForCausalLM
+        model = AutoGPTQForCausalLM.from_quantized(
+            args.checkpoint_path,
+            device_map=device_map,
+            trust_remote_code=True,
+            resume_download=True,
+            use_safetensors=True,
+        ).eval()
+    else:
+        model = AutoModelForCausalLM.from_pretrained(
+            args.checkpoint_path,
+            device_map=device_map,
+            trust_remote_code=True,
+            resume_download=True,
+        ).eval()
+    config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True, resume_download=True,
+    )
+    return model, tokenizer, config
+def _clear_screen():
+    if platform.system() == "Windows":
+        os.system("cls")
+    else:
+        os.system("clear")
+def _print_history(history):
+    terminal_width = shutil.get_terminal_size()[0]
+    print(f'History ({len(history)})'.center(terminal_width, '='))
+    for index, (query, response) in enumerate(history):
+        print(f'User[{index}]: {query}')
+        print(f'QWen[{index}]: {response}')
+    print('=' * terminal_width)
+def _get_input() -> str:
+    while True:
+        try:
+            message = input('User> ').strip()
+        except UnicodeDecodeError:
+            print('[ERROR] Encoding error in input')
+            continue
+        except KeyboardInterrupt:
+            exit(1)
+        if message:
+            return message
+        print('[ERROR] Query is empty')
+def main():
+    parser = argparse.ArgumentParser(
+        description='QWen-7B-Chat command-line interactive chat demo.')
+    parser.add_argument("-c", "--checkpoint-path", type=str, default=DEFAULT_CKPT_PATH,
+                        help="Checkpoint name or path, default to %(default)r")
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    parser.add_argument("--cpu-only", action="store_true", help="Run demo with CPU only")
+    args = parser.parse_args()
+    history, response = [], ''
+    model, tokenizer, config = _load_model_tokenizer(args)
+    orig_gen_config = deepcopy(model.generation_config)
+    _clear_screen()
+    print(_WELCOME_MSG)
+    seed = args.seed
+    while True:
+        query = _get_input()
+        # Process commands.
+        if query.startswith(':'):
+            command_words = query[1:].strip().split()
+            if not command_words:
+                command = ''
+            else:
+                command = command_words[0]
+            if command in ['exit', 'quit', 'q']:
+                break
+            elif command in ['clear', 'cl']:
+                _clear_screen()
+                print(_WELCOME_MSG)
+                continue
+            elif command in ['clear-history', 'clh']:
+                print(f'[INFO] All {len(history)} history cleared')
+                history.clear()
+                continue
+            elif command in ['help', 'h']:
+                print(_HELP_MSG)
+                continue
+            elif command in ['history', 'his']:
+                _print_history(history)
+                continue
+            elif command in ['seed']:
+                if len(command_words) == 1:
+                    print(f'[INFO] Current random seed: {seed}')
+                    continue
+                else:
+                    new_seed_s = command_words[1]
+                    try:
+                        new_seed = int(new_seed_s)
+                    except ValueError:
+                        print(f'[WARNING] Fail to change random seed: {new_seed_s!r} is not a valid number')
+                    else:
+                        print(f'[INFO] Random seed changed to {new_seed}')
+                        seed = new_seed
+                    continue
+            elif command in ['conf']:
+                if len(command_words) == 1:
+                    print(model.generation_config)
+                else:
+                    for key_value_pairs_str in command_words[1:]:
+                        eq_idx = key_value_pairs_str.find('=')
+                        if eq_idx == -1:
+                            print('[WARNING] format: <key>=<value>')
+                            continue
+                        conf_key, conf_value_str = key_value_pairs_str[:eq_idx], key_value_pairs_str[eq_idx + 1:]
+                        try:
+                            conf_value = eval(conf_value_str)
+                        except Exception as e:
+                            print(e)
+                            continue
+                        else:
+                            print(f'[INFO] Change config: model.generation_config.{conf_key} = {conf_value}')
+                            setattr(model.generation_config, conf_key, conf_value)
+                continue
+            elif command in ['reset-conf']:
+                print('[INFO] Reset generation config')
+                model.generation_config = deepcopy(orig_gen_config)
+                print(model.generation_config)
+                continue
+            else:
+                # As normal query.
+                pass
+        # Run chat.
+        set_seed(seed)
+        try:
+            for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
+                _clear_screen()
+                print(f"\nUser: {query}")
+                print(f"\nQwen-7B: {response}")
+        except KeyboardInterrupt:
+            print('[WARNING] Generation interrupted')
+            continue
+        history.append((query, response))
+if __name__ == "__main__":
+    main()

eval/EVALUATION.md ADDED Viewed

	@@ -0,0 +1,96 @@

+## 评测复现
+- CEVAL
+```Shell
+wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
+mkdir data/ceval
+mv ceval-exam.zip data/ceval
+cd data/ceval; unzip ceval-exam.zip
+cd ../../
+# Qwen-7B
+python evaluate_ceval.py -d data/ceval/
+# Qwen-7B-Chat
+pip install thefuzz
+python evaluate_chat_ceval.py -d data/ceval/
+```
+- MMLU
+```Shell
+wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
+mkdir data/mmlu
+mv data.tar data/mmlu
+cd data/mmlu; tar xf data.tar
+cd ../../
+# Qwen-7B
+python evaluate_mmlu.py -d data/mmlu/data/
+# Qwen-7B-Chat
+pip install thefuzz
+python evaluate_chat_mmlu.py -d data/mmlu/data/
+```
+- CMMLU
+```Shell
+wget https://huggingface.co/datasets/haonan-li/cmmlu/resolve/main/cmmlu_v1_0_1.zip
+mkdir data/cmmlu
+mv cmmlu_v1_0_1.zip data/cmmlu
+cd data/cmmlu; unzip cmmlu_v1_0_1.zip
+cd ../../
+# Qwen-7B
+python evaluate_cmmlu.py -d data/cmmlu/
+```
+- HumanEval
+Get the HumanEval.jsonl file from [here](https://github.com/openai/human-eval/tree/master/data)
+```Shell
+git clone https://github.com/openai/human-eval
+pip install -e human-eval
+# Qwen-7B
+python evaluate_humaneval.py -f HumanEval.jsonl -o HumanEval_res.jsonl
+evaluate_functional_correctness HumanEval_res.jsonl
+# Qwen-7B-Chat
+python evaluate_chat_mmlu.py -f HumanEval.jsonl -o HumanEval_res_chat.jsonl
+evaluate_functional_correctness HumanEval_res_chat.jsonl
+```
+When installing package human-eval, please note its following disclaimer:
+This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
+- GSM8K
+```Shell
+# Qwen-7B
+python evaluate_gsm8k.py
+# Qwen-7B-Chat
+python evaluate_chat_gsm8k.py # zeroshot
+python evaluate_chat_gsm8k.py --use-fewshot # fewshot
+```
+- PLUGIN
+This script is used to reproduce the results of the ReAct and Hugging Face Agent in the Tool Usage section of the README document.
+```Shell
+# Qwen-7B-Chat
+mkdir data;
+cd data;
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
+cd ..;
+pip install json5;
+pip install jsonlines;
+pip install rouge_score;
+python evaluate_plugin.py --eval-react-positive --eval-react-negative --eval-hfagent
+```

eval/evaluate_ceval.py ADDED Viewed

	@@ -0,0 +1,432 @@

+import os
+from typing import List
+import argparse
+import torch
+import pandas as pd
+import numpy as np
+from tqdm import tqdm
+from transformers.trainer_utils import set_seed
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+'''
+wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
+mkdir data/ceval
+mv ceval-exam.zip data/ceval
+cd data/ceval; unzip ceval-exam.zip
+cd ../../
+python evaluate_ceval.py -d data/ceval/
+'''
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    return model, tokenizer
+def format_example(line, include_answer=True):
+    example = "问题：" + line["question"]
+    for choice in choices:
+        example += f'\n{choice}. {line[f"{choice}"]}'
+    if include_answer:
+        example += "\n答案：" + line["answer"] + "\n\n"
+    else:
+        example += "\n答案："
+    return example
+def generate_few_shot_prompt(k, subject, dev_df):
+    prompt = ""
+    if k == -1:
+        k = dev_df.shape[0]
+    for i in range(k):
+        prompt += format_example(
+            dev_df.iloc[i, :],
+            include_answer=True,
+        )
+    return prompt
+def get_logits(tokenizer, model, inputs: List[str]):
+    input_ids = tokenizer(inputs, padding=False)["input_ids"]
+    input_ids = torch.tensor(input_ids, device=model.device)
+    tokens = {"input_ids": input_ids}
+    outputs = model(input_ids)["logits"]
+    logits = outputs[:, -1, :]
+    log_probs = torch.nn.functional.softmax(logits, dim=-1)
+    return log_probs, {"tokens": tokens}
+@torch.no_grad()
+def eval_subject(
+    model,
+    tokenizer,
+    subject_name,
+    test_df,
+    k=5,
+    dev_df=None,
+    few_shot=False,
+    save_result_dir=None,
+    **kwargs,
+):
+    result = []
+    score = []
+    few_shot_prompt = (
+        generate_few_shot_prompt(k, subject_name, dev_df) if few_shot else ""
+    )
+    all_probs = {"prob_A": [], "prob_B": [], "prob_C": [], "prob_D": []}
+    if args.debug:
+        print(f"few_shot_prompt: {few_shot_prompt}")
+    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
+        question = format_example(row, include_answer=False)
+        full_prompt = few_shot_prompt + question
+        output, input_info = get_logits(tokenizer, model, [full_prompt])
+        assert output.shape[0] == 1
+        logits = output.flatten()
+        softval = torch.nn.functional.softmax(
+            torch.tensor(
+                [
+                    logits[tokenizer("A")["input_ids"]],
+                    logits[tokenizer("B")["input_ids"]],
+                    logits[tokenizer("C")["input_ids"]],
+                    logits[tokenizer("D")["input_ids"]],
+                ]
+            ),
+            dim=0,
+        )
+        if softval.dtype in {torch.bfloat16, torch.float16}:
+            softval = softval.to(dtype=torch.float32)
+        probs = softval.detach().cpu().numpy()
+        for i, choice in enumerate(choices):
+            all_probs[f"prob_{choice}"].append(probs[i])
+        pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)]
+        if "answer" in row:
+            correct = 1 if pred == row["answer"] else 0
+            score.append(correct)
+            if args.debug:
+                print(f'{question} pred: {pred} ref: {row["answer"]}')
+        result.append(pred)
+    if score:
+        correct_ratio = 100 * sum(score) / len(score)
+        if args.debug:
+            print(subject_name, correct_ratio)
+    else:
+        correct_ratio = 0
+    if save_result_dir:
+        test_df["model_output"] = result
+        for i, choice in enumerate(choices):
+            test_df[f"prob_{choice}"] = all_probs[f"prob_{choice}"]
+        if score:
+            test_df["correctness"] = score
+        os.makedirs(save_result_dir, exist_ok=True)
+        test_df.to_csv(
+            os.path.join(save_result_dir, f"{subject_name}_result.csv"),
+            encoding="utf-8",
+            index=False,
+        )
+    return correct_ratio
+def cal_ceval(res):
+    acc_sum_dict = dict()
+    acc_norm_sum_dict = dict()
+    cnt_dict = dict()
+    acc_sum = 0.0
+    cnt = 0
+    hard_cnt = 0
+    hard_acc_sum = 0.0
+    for tt in res.keys():
+        name = tt.split("-")[-1]
+        acc_sum += float(res[tt])
+        cnt += 1
+        class_ = TASK_NAME_MAPPING[name][2]
+        if class_ not in acc_sum_dict:
+            acc_sum_dict[class_] = 0.0
+            acc_norm_sum_dict[class_] = 0.0
+            cnt_dict[class_] = 0.0
+        if name in hard_list:
+            hard_cnt += 1
+            hard_acc_sum += float(res[tt])
+        acc_sum_dict[class_] += float(res[tt])
+        cnt_dict[class_] += 1
+    print("\n\n\n")
+    for k in ["STEM", "Social Science", "Humanities", "Other"]:
+        if k in cnt_dict:
+            print("%s acc: %.2f " % (k, acc_sum_dict[k] / cnt_dict[k]))
+    if hard_cnt > 0:
+        print("Hard acc:%.2f " % (hard_acc_sum / hard_cnt))
+    print("AVERAGE acc:%.2f " % (acc_sum / cnt))
+TASK_NAME_MAPPING = {
+    "computer_network": ["Computer Network", "\u8ba1\u7b97\u673a\u7f51\u7edc", "STEM"],
+    "operating_system": ["Operating System", "\u64cd\u4f5c\u7cfb\u7edf", "STEM"],
+    "computer_architecture": [
+        "Computer Architecture",
+        "\u8ba1\u7b97\u673a\u7ec4\u6210",
+        "STEM",
+    ],
+    "college_programming": ["College Programming", "\u5927\u5b66\u7f16\u7a0b", "STEM"],
+    "college_physics": ["College Physics", "\u5927\u5b66\u7269\u7406", "STEM"],
+    "college_chemistry": ["College Chemistry", "\u5927\u5b66\u5316\u5b66", "STEM"],
+    "advanced_mathematics": [
+        "Advanced Mathematics",
+        "\u9ad8\u7b49\u6570\u5b66",
+        "STEM",
+    ],
+    "probability_and_statistics": [
+        "Probability and Statistics",
+        "\u6982\u7387\u7edf\u8ba1",
+        "STEM",
+    ],
+    "discrete_mathematics": [
+        "Discrete Mathematics",
+        "\u79bb\u6563\u6570\u5b66",
+        "STEM",
+    ],
+    "electrical_engineer": [
+        "Electrical Engineer",
+        "\u6ce8\u518c\u7535\u6c14\u5de5\u7a0b\u5e08",
+        "STEM",
+    ],
+    "metrology_engineer": [
+        "Metrology Engineer",
+        "\u6ce8\u518c\u8ba1\u91cf\u5e08",
+        "STEM",
+    ],
+    "high_school_mathematics": [
+        "High School Mathematics",
+        "\u9ad8\u4e2d\u6570\u5b66",
+        "STEM",
+    ],
+    "high_school_physics": ["High School Physics", "\u9ad8\u4e2d\u7269\u7406", "STEM"],
+    "high_school_chemistry": [
+        "High School Chemistry",
+        "\u9ad8\u4e2d\u5316\u5b66",
+        "STEM",
+    ],
+    "high_school_biology": ["High School Biology", "\u9ad8\u4e2d\u751f\u7269", "STEM"],
+    "middle_school_mathematics": [
+        "Middle School Mathematics",
+        "\u521d\u4e2d\u6570\u5b66",
+        "STEM",
+    ],
+    "middle_school_biology": [
+        "Middle School Biology",
+        "\u521d\u4e2d\u751f\u7269",
+        "STEM",
+    ],
+    "middle_school_physics": [
+        "Middle School Physics",
+        "\u521d\u4e2d\u7269\u7406",
+        "STEM",
+    ],
+    "middle_school_chemistry": [
+        "Middle School Chemistry",
+        "\u521d\u4e2d\u5316\u5b66",
+        "STEM",
+    ],
+    "veterinary_medicine": ["Veterinary Medicine", "\u517d\u533b\u5b66", "STEM"],
+    "college_economics": [
+        "College Economics",
+        "\u5927\u5b66\u7ecf\u6d4e\u5b66",
+        "Social Science",
+    ],
+    "business_administration": [
+        "Business Administration",
+        "\u5de5\u5546\u7ba1\u7406",
+        "Social Science",
+    ],
+    "marxism": [
+        "Marxism",
+        "\u9a6c\u514b\u601d\u4e3b\u4e49\u57fa\u672c\u539f\u7406",
+        "Social Science",
+    ],
+    "mao_zedong_thought": [
+        "Mao Zedong Thought",
+        "\u6bdb\u6cfd\u4e1c\u601d\u60f3\u548c\u4e2d\u56fd\u7279\u8272\u793e\u4f1a\u4e3b\u4e49\u7406\u8bba\u4f53\u7cfb\u6982\u8bba",
+        "Social Science",
+    ],
+    "education_science": ["Education Science", "\u6559\u80b2\u5b66", "Social Science"],
+    "teacher_qualification": [
+        "Teacher Qualification",
+        "\u6559\u5e08\u8d44\u683c",
+        "Social Science",
+    ],
+    "high_school_politics": [
+        "High School Politics",
+        "\u9ad8\u4e2d\u653f\u6cbb",
+        "Social Science",
+    ],
+    "high_school_geography": [
+        "High School Geography",
+        "\u9ad8\u4e2d\u5730\u7406",
+        "Social Science",
+    ],
+    "middle_school_politics": [
+        "Middle School Politics",
+        "\u521d\u4e2d\u653f\u6cbb",
+        "Social Science",
+    ],
+    "middle_school_geography": [
+        "Middle School Geography",
+        "\u521d\u4e2d\u5730\u7406",
+        "Social Science",
+    ],
+    "modern_chinese_history": [
+        "Modern Chinese History",
+        "\u8fd1\u4ee3\u53f2\u7eb2\u8981",
+        "Humanities",
+    ],
+    "ideological_and_moral_cultivation": [
+        "Ideological and Moral Cultivation",
+        "\u601d\u60f3\u9053\u5fb7\u4fee\u517b\u4e0e\u6cd5\u5f8b\u57fa\u7840",
+        "Humanities",
+    ],
+    "logic": ["Logic", "\u903b\u8f91\u5b66", "Humanities"],
+    "law": ["Law", "\u6cd5\u5b66", "Humanities"],
+    "chinese_language_and_literature": [
+        "Chinese Language and Literature",
+        "\u4e2d\u56fd\u8bed\u8a00\u6587\u5b66",
+        "Humanities",
+    ],
+    "art_studies": ["Art Studies", "\u827a\u672f\u5b66", "Humanities"],
+    "professional_tour_guide": [
+        "Professional Tour Guide",
+        "\u5bfc\u6e38\u8d44\u683c",
+        "Humanities",
+    ],
+    "legal_professional": [
+        "Legal Professional",
+        "\u6cd5\u5f8b\u804c\u4e1a\u8d44\u683c",
+        "Humanities",
+    ],
+    "high_school_chinese": [
+        "High School Chinese",
+        "\u9ad8\u4e2d\u8bed\u6587",
+        "Humanities",
+    ],
+    "high_school_history": [
+        "High School History",
+        "\u9ad8\u4e2d\u5386\u53f2",
+        "Humanities",
+    ],
+    "middle_school_history": [
+        "Middle School History",
+        "\u521d\u4e2d\u5386\u53f2",
+        "Humanities",
+    ],
+    "civil_servant": ["Civil Servant", "\u516c\u52a1\u5458", "Other"],
+    "sports_science": ["Sports Science", "\u4f53\u80b2\u5b66", "Other"],
+    "plant_protection": ["Plant Protection", "\u690d\u7269\u4fdd\u62a4", "Other"],
+    "basic_medicine": ["Basic Medicine", "\u57fa\u7840\u533b\u5b66", "Other"],
+    "clinical_medicine": ["Clinical Medicine", "\u4e34\u5e8a\u533b\u5b66", "Other"],
+    "urban_and_rural_planner": [
+        "Urban and Rural Planner",
+        "\u6ce8\u518c\u57ce\u4e61\u89c4\u5212\u5e08",
+        "Other",
+    ],
+    "accountant": ["Accountant", "\u6ce8\u518c\u4f1a\u8ba1\u5e08", "Other"],
+    "fire_engineer": [
+        "Fire Engineer",
+        "\u6ce8\u518c\u6d88\u9632\u5de5\u7a0b\u5e08",
+        "Other",
+    ],
+    "environmental_impact_assessment_engineer": [
+        "Environmental Impact Assessment Engineer",
+        "\u73af\u5883\u5f71\u54cd\u8bc4\u4ef7\u5de5\u7a0b\u5e08",
+        "Other",
+    ],
+    "tax_accountant": ["Tax Accountant", "\u7a0e\u52a1\u5e08", "Other"],
+    "physician": ["Physician", "\u533b\u5e08\u8d44\u683c", "Other"],
+}
+hard_list = [
+    "advanced_mathematics",
+    "discrete_mathematics",
+    "probability_and_statistics",
+    "college_physics",
+    "college_chemistry",
+    "high_school_mathematics",
+    "high_school_physics",
+    "high_school_chemistry",
+]
+choices = ["A", "B", "C", "D"]
+def main(args):
+    model, tokenizer = load_models_tokenizer(args)
+    dev_result = {}
+    for subject_name in tqdm(TASK_NAME_MAPPING.keys()):
+        val_file_path = os.path.join(
+            args.eval_data_path, "val", f"{subject_name}_val.csv"
+        )
+        dev_file_path = os.path.join(
+            args.eval_data_path, "dev", f"{subject_name}_dev.csv"
+        )
+        # test_file_path = os.path.join(args.eval_data_path, 'test', f'{subject_name}_test.csv')
+        val_df = pd.read_csv(val_file_path)
+        dev_df = pd.read_csv(dev_file_path)
+        # test_df = pd.read_csv(test_file_path)
+        score = eval_subject(
+            model,
+            tokenizer,
+            subject_name,
+            val_df,
+            dev_df=dev_df,
+            k=5,
+            few_shot=True,
+            save_result_dir=f"outs/ceval_eval_result",
+        )
+        dev_result[subject_name] = score
+    cal_ceval(dev_result)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    # Provide extra arguments required for tasks
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument(
+        "-d", "--eval_data_path", type=str, required=True, help="Path to eval data"
+    )
+    group.add_argument(
+        "--max-seq-len",
+        type=int,
+        default=2048,
+        help="Size of the output generated text.",
+    )
+    group.add_argument(
+        "--debug", action="store_true", default=False, help="Print infos."
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/evaluate_chat_ceval.py ADDED Viewed

	@@ -0,0 +1,459 @@

+import os
+import argparse
+import re
+import torch
+import pandas as pd
+from thefuzz import process
+from tqdm import tqdm
+from transformers.trainer_utils import set_seed
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+'''
+wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
+mkdir data/ceval
+mv ceval-exam.zip data/ceval
+cd data/ceval; unzip ceval-exam.zip
+cd ../../
+pip install thefuzz
+python eval/evaluate_chat_ceval.py -d data/ceval
+'''
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False  # use greedy decoding
+    return model, tokenizer
+def process_before_extraction(gen, question, choice_dict):
+    # Example Prompt:
+    # 关于传输层的面向连接服务的特性是____。
+    # A. 既不保证可靠，也不保证按序交付
+    # B. 不保证可靠，但保证按序交付
+    # C. 保证可靠，但不保证按序交付
+    # D. 既保证可靠，也保证按序交付
+    # Example Model Output：
+    # 关于传输层的面向连接服务的特性是既保证可靠，也保证按序交付
+    # Processed Output:
+    # 答案是D
+    question_split = question.rstrip("。").split("。")[-1].split("_")
+    # replacing the question
+    if len(question_split[0].strip()) > 4:
+        gen = gen.replace(question_split[0], "答案是")
+    if len(question_split[-1].strip()) > 4:
+        gen = gen.replace(question_split[-1], "")
+    # replace the choice by letter in the generated sentence
+    # from longest one to shortest one
+    for key, val in sorted(choice_dict.items(), key=lambda x: len(x[1]), reverse=True):
+        gen = gen.replace(val.rstrip("。"), key)
+    return gen
+def count_substr(gen, pattern):
+    return len(re.findall(pattern, gen))
+def extract_choice(gen, prompt, choice_list):
+    # 答案是A | 选项是A | 应该选A选项
+    res = re.search(
+        r"(?:(?:选|选择|选定)[：:]?\s*|(?:(?:答案|选项)(?![^ABCD]{0,10}?(?:不|非)[^ABCD]{0,10}?(?:是|选|为|：|:|】))[^ABCD]{0,10}?(?:是|选|为|：|:|】))[^ABCD]{0,10}?)(A|B|C|D)(?:选项)?(?:\)|。|\.|，|,|．|、|A|B|C|D|$|：|:|\)|）)",
+        gen,
+    )
+    # A选项正确 | A选项符合题意
+    if res is None:
+        res = re.search(
+            r"(A|B|C|D)(?:选?项)?(?![^ABCD]{0,4}?(?:不|非)[^ABCD]{0,4}?(?:正确|对[的，。：]|符合))[^ABCD]{0,4}?(?:正确|对[的，。：]|符合)",
+            gen,
+        )
+    # 直接输出 A
+    if res is None:
+        res = re.search(r"^[\(（]?(A|B|C|D)(?:。|\)|）|\.|，|,|．|：|:|$)", gen)
+    # 获取第一个出现的字母
+    if res is None:
+        res = re.search(r"(?<![a-zA-Z])(A|B|C|D)(?![a-zA-Z=])", gen)
+    if res is None:
+        return choices[choice_list.index(process.extractOne(gen, choice_list)[0])]
+    return res.group(1)
+def format_example(line):
+    example = line["question"] + "\n\n"
+    for choice in choices:
+        example += f'{choice}. {line[f"{choice}"]}\n'
+    return example
+def extract_answer(response, row):
+    prompt = row["question"]
+    gen = process_before_extraction(
+        response, prompt, {choice: row[choice] for choice in choices}
+    )
+    if not isinstance(prompt, str):
+        prompt = prompt[0]
+    pred = extract_choice(gen, prompt, [row[choice] for choice in choices])
+    return pred
+@torch.no_grad()
+def eval_subject(
+    model,
+    tokenizer,
+    subject_name,
+    test_df,
+    save_result_dir=None,
+    overwrite=False,
+    **kwargs
+):
+    result_path = os.path.join(save_result_dir, f"{subject_name}_result.csv")
+    if not overwrite and os.path.exists(result_path):
+        print(f"{result_path} existed, skip!")
+        score = []
+        for (_, datarow), (_, resultrow) in zip(
+            test_df.iterrows(), pd.read_csv(result_path).iterrows()
+        ):
+            pred = extract_answer(resultrow["model_response"], datarow)
+            correct = 1 if pred == datarow["answer"] else 0
+            score.append(correct)
+        correct_ratio = 100 * sum(score) / len(score)
+        return correct_ratio
+    responses = []
+    result = []
+    score = []
+    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
+        question = format_example(row)
+        response, _ = model.chat(
+            tokenizer,
+            question,
+            history=None,
+        )
+        print(question)
+        print(response)
+        pred = extract_answer(response, row)
+        print(pred)
+        print("======================")
+        if "answer" in row:
+            correct = 1 if pred == row["answer"] else 0
+            score.append(correct)
+            if args.debug:
+                print(f'{question} pred: {pred} ref: {row["answer"]}')
+        responses.append(response)
+        result.append(pred)
+    if score:
+        correct_ratio = 100 * sum(score) / len(score)
+        if args.debug:
+            print(subject_name, correct_ratio)
+    else:
+        correct_ratio = 0
+    if save_result_dir:
+        test_df["model_response"] = responses
+        test_df["model_output"] = result
+        if score:
+            test_df["correctness"] = score
+        os.makedirs(save_result_dir, exist_ok=True)
+        test_df.to_csv(result_path, encoding="utf-8", index=False)
+    return correct_ratio
+def cal_ceval(res):
+    acc_sum_dict = dict()
+    acc_norm_sum_dict = dict()
+    cnt_dict = dict()
+    acc_sum = 0.0
+    cnt = 0
+    hard_cnt = 0
+    hard_acc_sum = 0.0
+    for tt in res.keys():
+        name = tt.split("-")[-1]
+        acc_sum += float(res[tt])
+        cnt += 1
+        class_ = TASK_NAME_MAPPING[name][2]
+        if class_ not in acc_sum_dict:
+            acc_sum_dict[class_] = 0.0
+            acc_norm_sum_dict[class_] = 0.0
+            cnt_dict[class_] = 0.0
+        if name in hard_list:
+            hard_cnt += 1
+            hard_acc_sum += float(res[tt])
+        acc_sum_dict[class_] += float(res[tt])
+        cnt_dict[class_] += 1
+    print("\n\n\n")
+    for k in ["STEM", "Social Science", "Humanities", "Other"]:
+        if k in cnt_dict:
+            print("%s acc: %.2f " % (k, acc_sum_dict[k] / cnt_dict[k]))
+    if hard_cnt > 0:
+        print("Hard acc:%.2f " % (hard_acc_sum / hard_cnt))
+    print("AVERAGE acc:%.2f " % (acc_sum / cnt))
+TASK_NAME_MAPPING = {
+    "computer_network": ["Computer Network", "\u8ba1\u7b97\u673a\u7f51\u7edc", "STEM"],
+    "operating_system": ["Operating System", "\u64cd\u4f5c\u7cfb\u7edf", "STEM"],
+    "computer_architecture": [
+        "Computer Architecture",
+        "\u8ba1\u7b97\u673a\u7ec4\u6210",
+        "STEM",
+    ],
+    "college_programming": ["College Programming", "\u5927\u5b66\u7f16\u7a0b", "STEM"],
+    "college_physics": ["College Physics", "\u5927\u5b66\u7269\u7406", "STEM"],
+    "college_chemistry": ["College Chemistry", "\u5927\u5b66\u5316\u5b66", "STEM"],
+    "advanced_mathematics": [
+        "Advanced Mathematics",
+        "\u9ad8\u7b49\u6570\u5b66",
+        "STEM",
+    ],
+    "probability_and_statistics": [
+        "Probability and Statistics",
+        "\u6982\u7387\u7edf\u8ba1",
+        "STEM",
+    ],
+    "discrete_mathematics": [
+        "Discrete Mathematics",
+        "\u79bb\u6563\u6570\u5b66",
+        "STEM",
+    ],
+    "electrical_engineer": [
+        "Electrical Engineer",
+        "\u6ce8\u518c\u7535\u6c14\u5de5\u7a0b\u5e08",
+        "STEM",
+    ],
+    "metrology_engineer": [
+        "Metrology Engineer",
+        "\u6ce8\u518c\u8ba1\u91cf\u5e08",
+        "STEM",
+    ],
+    "high_school_mathematics": [
+        "High School Mathematics",
+        "\u9ad8\u4e2d\u6570\u5b66",
+        "STEM",
+    ],
+    "high_school_physics": ["High School Physics", "\u9ad8\u4e2d\u7269\u7406", "STEM"],
+    "high_school_chemistry": [
+        "High School Chemistry",
+        "\u9ad8\u4e2d\u5316\u5b66",
+        "STEM",
+    ],
+    "high_school_biology": ["High School Biology", "\u9ad8\u4e2d\u751f\u7269", "STEM"],
+    "middle_school_mathematics": [
+        "Middle School Mathematics",
+        "\u521d\u4e2d\u6570\u5b66",
+        "STEM",
+    ],
+    "middle_school_biology": [
+        "Middle School Biology",
+        "\u521d\u4e2d\u751f\u7269",
+        "STEM",
+    ],
+    "middle_school_physics": [
+        "Middle School Physics",
+        "\u521d\u4e2d\u7269\u7406",
+        "STEM",
+    ],
+    "middle_school_chemistry": [
+        "Middle School Chemistry",
+        "\u521d\u4e2d\u5316\u5b66",
+        "STEM",
+    ],
+    "veterinary_medicine": ["Veterinary Medicine", "\u517d\u533b\u5b66", "STEM"],
+    "college_economics": [
+        "College Economics",
+        "\u5927\u5b66\u7ecf\u6d4e\u5b66",
+        "Social Science",
+    ],
+    "business_administration": [
+        "Business Administration",
+        "\u5de5\u5546\u7ba1\u7406",
+        "Social Science",
+    ],
+    "marxism": [
+        "Marxism",
+        "\u9a6c\u514b\u601d\u4e3b\u4e49\u57fa\u672c\u539f\u7406",
+        "Social Science",
+    ],
+    "mao_zedong_thought": [
+        "Mao Zedong Thought",
+        "\u6bdb\u6cfd\u4e1c\u601d\u60f3\u548c\u4e2d\u56fd\u7279\u8272\u793e\u4f1a\u4e3b\u4e49\u7406\u8bba\u4f53\u7cfb\u6982\u8bba",
+        "Social Science",
+    ],
+    "education_science": ["Education Science", "\u6559\u80b2\u5b66", "Social Science"],
+    "teacher_qualification": [
+        "Teacher Qualification",
+        "\u6559\u5e08\u8d44\u683c",
+        "Social Science",
+    ],
+    "high_school_politics": [
+        "High School Politics",
+        "\u9ad8\u4e2d\u653f\u6cbb",
+        "Social Science",
+    ],
+    "high_school_geography": [
+        "High School Geography",
+        "\u9ad8\u4e2d\u5730\u7406",
+        "Social Science",
+    ],
+    "middle_school_politics": [
+        "Middle School Politics",
+        "\u521d\u4e2d\u653f\u6cbb",
+        "Social Science",
+    ],
+    "middle_school_geography": [
+        "Middle School Geography",
+        "\u521d\u4e2d\u5730\u7406",
+        "Social Science",
+    ],
+    "modern_chinese_history": [
+        "Modern Chinese History",
+        "\u8fd1\u4ee3\u53f2\u7eb2\u8981",
+        "Humanities",
+    ],
+    "ideological_and_moral_cultivation": [
+        "Ideological and Moral Cultivation",
+        "\u601d\u60f3\u9053\u5fb7\u4fee\u517b\u4e0e\u6cd5\u5f8b\u57fa\u7840",
+        "Humanities",
+    ],
+    "logic": ["Logic", "\u903b\u8f91\u5b66", "Humanities"],
+    "law": ["Law", "\u6cd5\u5b66", "Humanities"],
+    "chinese_language_and_literature": [
+        "Chinese Language and Literature",
+        "\u4e2d\u56fd\u8bed\u8a00\u6587\u5b66",
+        "Humanities",
+    ],
+    "art_studies": ["Art Studies", "\u827a\u672f\u5b66", "Humanities"],
+    "professional_tour_guide": [
+        "Professional Tour Guide",
+        "\u5bfc\u6e38\u8d44\u683c",
+        "Humanities",
+    ],
+    "legal_professional": [
+        "Legal Professional",
+        "\u6cd5\u5f8b\u804c\u4e1a\u8d44\u683c",
+        "Humanities",
+    ],
+    "high_school_chinese": [
+        "High School Chinese",
+        "\u9ad8\u4e2d\u8bed\u6587",
+        "Humanities",
+    ],
+    "high_school_history": [
+        "High School History",
+        "\u9ad8\u4e2d\u5386\u53f2",
+        "Humanities",
+    ],
+    "middle_school_history": [
+        "Middle School History",
+        "\u521d\u4e2d\u5386\u53f2",
+        "Humanities",
+    ],
+    "civil_servant": ["Civil Servant", "\u516c\u52a1\u5458", "Other"],
+    "sports_science": ["Sports Science", "\u4f53\u80b2\u5b66", "Other"],
+    "plant_protection": ["Plant Protection", "\u690d\u7269\u4fdd\u62a4", "Other"],
+    "basic_medicine": ["Basic Medicine", "\u57fa\u7840\u533b\u5b66", "Other"],
+    "clinical_medicine": ["Clinical Medicine", "\u4e34\u5e8a\u533b\u5b66", "Other"],
+    "urban_and_rural_planner": [
+        "Urban and Rural Planner",
+        "\u6ce8\u518c\u57ce\u4e61\u89c4\u5212\u5e08",
+        "Other",
+    ],
+    "accountant": ["Accountant", "\u6ce8\u518c\u4f1a\u8ba1\u5e08", "Other"],
+    "fire_engineer": [
+        "Fire Engineer",
+        "\u6ce8\u518c\u6d88\u9632\u5de5\u7a0b\u5e08",
+        "Other",
+    ],
+    "environmental_impact_assessment_engineer": [
+        "Environmental Impact Assessment Engineer",
+        "\u73af\u5883\u5f71\u54cd\u8bc4\u4ef7\u5de5\u7a0b\u5e08",
+        "Other",
+    ],
+    "tax_accountant": ["Tax Accountant", "\u7a0e\u52a1\u5e08", "Other"],
+    "physician": ["Physician", "\u533b\u5e08\u8d44\u683c", "Other"],
+}
+hard_list = [
+    "advanced_mathematics",
+    "discrete_mathematics",
+    "probability_and_statistics",
+    "college_physics",
+    "college_chemistry",
+    "high_school_mathematics",
+    "high_school_physics",
+    "high_school_chemistry",
+]
+choices = ["A", "B", "C", "D"]
+def main(args):
+    print("loading model weights")
+    if args.checkpoint_path:
+        model, tokenizer = load_models_tokenizer(args)
+    else:
+        model, tokenizer = None, None
+    print("model loaded")
+    dev_result = {}
+    for subject_name in tqdm(TASK_NAME_MAPPING.keys()):
+        val_file_path = os.path.join(
+            args.eval_data_path, "val", f"{subject_name}_val.csv"
+        )
+        val_df = pd.read_csv(val_file_path)
+        score = eval_subject(
+            model,
+            tokenizer,
+            subject_name,
+            val_df,
+            save_result_dir="outs_chat/ceval_eval_result",
+            overwrite=args.overwrite,
+        )
+        dev_result[subject_name] = score
+    cal_ceval(dev_result)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B-Chat",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    # Provide extra arguments required for tasks
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument(
+        "-d", "--eval_data_path", type=str, required=True, help="Path to eval data"
+    )
+    group.add_argument(
+        "--debug", action="store_true", default=False, help="Print infos."
+    )
+    group.add_argument(
+        "--overwrite",
+        action="store_true",
+        default=False,
+        help="Overwrite existed results",
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/evaluate_chat_gsm8k.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import json
+import re
+from pathlib import Path
+import argparse
+import numpy as np
+import tqdm
+from datasets import load_from_disk, load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+'''
+python eval/evaluate_chat_gsm8k.py [--use-fewshot]
+'''
+INVALID_ANS = "[invalid]"
+DEVICE = "cuda:0"
+def doc_to_text(doc, use_fewshot):
+    if use_fewshot:
+        context = (
+            "Question: Angelo and Melanie want to plan how many hours over the next week they should study together for their test next week. They have 2 chapters of their textbook to study and 4 worksheets to memorize. They figure out that they should dedicate 3 hours to each chapter of their textbook and 1.5 hours for each worksheet. If they plan to study no more than 4 hours each day, how many days should they plan to study total over the next week if they take a 10-minute break every hour, include 3 10-minute snack breaks each day, and 30 minutes for lunch each day?\nLet's think step by step\n"
+            "Angelo and Melanie think they should dedicate 3 hours to each of the 2 chapters, 3 hours x 2 chapters = 6 hours total.\nFor the worksheets they plan to dedicate 1.5 hours for each worksheet, 1.5 hours x 4 worksheets = 6 hours total.\nAngelo and Melanie need to start with planning 12 hours to study, at 4 hours a day, 12 / 4 = 3 days.\nHowever, they need to include time for breaks and lunch. Every hour they want to include a 10-minute break, so 12 total hours x 10 minutes = 120 extra minutes for breaks.\nThey also want to include 3 10-minute snack breaks, 3 x 10 minutes = 30 minutes.\nAnd they want to include 30 minutes for lunch each day, so 120 minutes for breaks + 30 minutes for snack breaks + 30 minutes for lunch = 180 minutes, or 180 / 60 minutes per hour = 3 extra hours.\nSo Angelo and Melanie want to plan 12 hours to study + 3 hours of breaks = 15 hours total.\nThey want to study no more than 4 hours each day, 15 hours / 4 hours each day = 3.75\nThey will need to plan to study 4 days to allow for all the time they need.\nThe answer is 4\n\n"
+            "Question: Mark's basketball team scores 25 2 pointers, 8 3 pointers and 10 free throws.  Their opponents score double the 2 pointers but half the 3 pointers and free throws.  What's the total number of points scored by both teams added together?\nLet's think step by step\n"
+            "Mark's team scores 25 2 pointers, meaning they scored 25*2= 50 points in 2 pointers.\nHis team also scores 6 3 pointers, meaning they scored 8*3= 24 points in 3 pointers\nThey scored 10 free throws, and free throws count as one point so they scored 10*1=10 points in free throws.\nAll together his team scored 50+24+10= 84 points\nMark's opponents scored double his team's number of 2 pointers, meaning they scored 50*2=100 points in 2 pointers.\nHis opponents scored half his team's number of 3 pointers, meaning they scored 24/2= 12 points in 3 pointers.\nThey also scored half Mark's team's points in free throws, meaning they scored 10/2=5 points in free throws.\nAll together Mark's opponents scored 100+12+5=117 points\nThe total score for the game is both team's scores added together, so it is 84+117=201 points\nThe answer is 201\n\n"
+            "Question: Bella has two times as many marbles as frisbees. She also has 20 more frisbees than deck cards. If she buys 2/5 times more of each item, what would be the total number of the items she will have if she currently has 60 marbles?\nLet's think step by step\n"
+            "When Bella buys 2/5 times more marbles, she'll have increased the number of marbles by 2/5*60 = 24\nThe total number of marbles she'll have is 60+24 = 84\nIf Bella currently has 60 marbles, and she has two times as many marbles as frisbees, she has 60/2 = 30 frisbees.\nIf Bella buys 2/5 times more frisbees, she'll have 2/5*30 = 12 more frisbees.\nThe total number of frisbees she'll have will increase to 30+12 = 42\nBella also has 20 more frisbees than deck cards, meaning she has 30-20 = 10 deck cards\nIf she buys 2/5 times more deck cards, she'll have 2/5*10 = 4 more deck cards.\nThe total number of deck cards she'll have is 10+4 = 14\nTogether, Bella will have a total of 14+42+84 = 140 items\nThe answer is 140\n\n"
+            "Question: A group of 4 fruit baskets contains 9 apples, 15 oranges, and 14 bananas in the first three baskets and 2 less of each fruit in the fourth basket. How many fruits are there?\nLet's think step by step\n"
+            "For the first three baskets, the number of apples and oranges in one basket is 9+15=24\nIn total, together with bananas, the number of fruits in one basket is 24+14=38 for the first three baskets.\nSince there are three baskets each having 38 fruits, there are 3*38=114 fruits in the first three baskets.\nThe number of apples in the fourth basket is 9-2=7\nThere are also 15-2=13 oranges in the fourth basket\nThe combined number of oranges and apples in the fourth basket is 13+7=20\nThe fourth basket also contains 14-2=12 bananas.\nIn total, the fourth basket has 20+12=32 fruits.\nThe four baskets together have 32+114=146 fruits.\nThe answer is 146\n\n"
+            f"Question: {doc['question']}\nLet's think step by step"
+        )
+    else:
+        context = doc["question"]
+    return context
+def decode(tokens_list, tokenizer, raw_text_len):
+    sents = []
+    for tokens in tokens_list:
+        tokens = tokens.cpu().numpy().tolist()
+        sent = tokenizer.tokenizer.decode(tokens[raw_text_len:])
+        sent = sent.split("<|endoftext|>")[0]
+        sent = sent.split("\n\n\n")[0]
+        sent = sent.split("\n\n")[0]
+        sent = sent.split("Question:")[0]
+        sents.append(sent)
+    return sents
+def generate_sample(model, tokenizer, question):
+    response, _ = model.chat(
+        tokenizer,
+        question,
+        history=None,
+    )
+    print(question)
+    print("-------------")
+    print(response)
+    print("=============")
+    return response
+def extract_answer_hf(completion):
+    def _get_last_digit(s):
+        _PAT_LAST_DIGIT = re.compile(
+            r"(?<=(\s|[\$%#{]))([+-])?(?=(\S))(0|([1-9](\d*|\d{0,2}(,\d{3})*)))?(\.\d*[1-9])?(?=(\s|[.,}]|$))"
+        )
+        match = list(_PAT_LAST_DIGIT.finditer(s))
+        if match:
+            last_digit = match[-1].group().replace(",", "").replace("+", "")
+            # print(f"The last digit in {s} is {last_digit}")
+        else:
+            last_digit = None
+            print(f"No digits found in {s!r}")
+        return last_digit
+    job_gen = completion.strip(".").replace("\n", "\\n")
+    last_digit = _get_last_digit(job_gen)
+    if last_digit is not None:
+        return eval(last_digit)
+    return INVALID_ANS
+def extract_answer(completion):
+    try:
+        last_number = re.findall(r"\d+", completion)[-1]
+        return eval(last_number)
+    except:
+        return INVALID_ANS
+def is_correct(completion, answer):
+    gold = extract_answer(answer)
+    assert gold != INVALID_ANS, "No ground truth answer found in the document."
+    return extract_answer(completion) == gold
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=Path,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B-Chat",
+    )
+    parser.add_argument("-f", "--sample-input-file", type=str, default=None)
+    parser.add_argument(
+        "-o", "--sample-output-file", type=str, default="gsm8k_res.jsonl"
+    )
+    parser.add_argument("--use-fewshot", action="store_true")
+    args = parser.parse_args()
+    if args.sample_input_file is not None:
+        dataset = load_from_disk(args.sample_input_file)  # or:
+    else:
+        dataset = load_dataset("gsm8k", "main")
+    print("Loading tokenizer ...")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True, bf16=True, use_flash_attn=True
+    )
+    print("Loading model ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False  # use greedy decoding
+    test = dataset["test"]
+    f_output = open(args.sample_output_file, "w", encoding="utf-8")
+    tot_length = test.num_rows
+    acc_res = []
+    for doc in tqdm.tqdm(test):
+        context = doc_to_text(doc, args.use_fewshot)
+        print(context)
+        completion = generate_sample(model, tokenizer, context)
+        answer = doc["answer"]
+        acc = is_correct(completion, answer)
+        doc["completion"] = completion
+        doc["acc"] = acc
+        f_output.write(json.dumps(doc, ensure_ascii=False) + "\n")
+        f_output.flush()
+        acc_res.append(acc)
+    f_output.close()
+    print("4-shot Acc: " if args.use_fewshot else "Zero-shot Acc", np.mean(acc_res))

eval/evaluate_chat_humaneval.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import re
+import textwrap
+import argparse
+from pathlib import Path
+import tqdm
+import jsonlines
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+"""
+Get the HumanEval.jsonl file from [here](https://github.com/openai/human-eval/tree/master/data)
+python eval/evaluate_chat_humaneval.py -f HumanEval.jsonl -o HumanEval_res.jsonl
+git clone https://github.com/openai/human-eval
+pip install -e human-eval
+evaluate_functional_correctness HumanEval_res.jsonl
+"""
+DEVICE = "cuda:0"
+def extract_code(text, entry_point):
+    # 正则表达式匹配代码块
+    code_block_pattern = re.compile(
+        rf"```(?:[Pp]ython\n)?.*?def\s+{entry_point}.*?:\n(.*?)\n```", re.DOTALL
+    )
+    code_block = code_block_pattern.search(text)
+    if code_block is None:
+        code_block_pattern = re.compile(
+            rf"def\s+{entry_point}.*?:\n(.*?)(?:\n(?!\n*(?:  |\t))|$)", re.DOTALL
+        )
+        code_block = code_block_pattern.search(text)
+    if code_block is None:
+        code_block_pattern = re.compile(
+            r"def.*?:\n(.*?)(?:\n(?!\n*(?:  |\t))|$)", re.DOTALL
+        )
+        code_block = code_block_pattern.search(text)
+    if code_block is not None:
+        return code_block.group(1)
+    # if no code block is found, assume the LM is simply filling the code
+    return textwrap.indent(text, " " * 4)
+def generate_sample(model, tokenizer, question, entry_point):
+    response, _ = model.chat(
+        tokenizer,
+        question,
+        history=None,
+    )
+    print(question)
+    print(response)
+    answer = extract_code(response, entry_point)
+    return answer, response
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=Path,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B-Chat",
+    )
+    parser.add_argument(
+        "-f",
+        "--sample-input-file",
+        type=str,
+        default=None,
+        help="data path to HumanEval.jsonl",
+    )
+    parser.add_argument(
+        "-o", "--sample-output-file", type=str, default="HumanEval_res.jsonl"
+    )
+    args = parser.parse_args()
+    print("Loading tokenizer ...")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    print("Loading model ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path,
+        device_map="auto",
+        trust_remote_code=True,
+        bf16=True,
+        use_flash_attn=True,
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False  # use greedy decoding
+    f_output = jsonlines.Writer(open(args.sample_output_file, "w", encoding="utf-8"))
+    f = jsonlines.open(args.sample_input_file)
+    with f_output as output:
+        for jobj in tqdm.tqdm(f, desc="task_idx"):
+            prompt = "Help me fill the following code.\n" + jobj["prompt"]
+            task_id = jobj["task_id"]
+            answer, response = generate_sample(
+                model, tokenizer, prompt, jobj["entry_point"]
+            )
+            gen_jobjs = {"task_id": task_id, "completion": answer, "response": response}
+            output.write(gen_jobjs)
+    f_output.close()

eval/evaluate_chat_mmlu.py ADDED Viewed

	@@ -0,0 +1,314 @@

+import os
+import argparse
+import re
+import torch
+import pandas as pd
+from tqdm import tqdm
+from thefuzz import process
+from transformers.trainer_utils import set_seed
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+'''
+wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
+mkdir data/mmlu
+mv data.tar data/mmlu
+cd data/mmlu; tar xf data.tar
+cd ../../
+pip install thefuzz
+python eval/evaluate_chat_mmlu.py -d data/mmlu/data/
+'''
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path,
+        device_map="auto",
+        trust_remote_code=True,
+        bf16=True,
+        use_flash_attn=True,
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False  # use greedy decoding
+    return model, tokenizer
+def format_example(line):
+    example = (
+        "The following is a multiple-choice question. Please choose the most suitable one among A, B, C and D as the answer to this question.\n\n"
+        + line["question"]
+        + "\n"
+    )
+    for choice in choices:
+        example += f'{choice}. {line[f"{choice}"]}\n'
+    return example
+def process_before_extraction(gen, choice_dict):
+    # replace the choice by letter in the generated sentence
+    # from longest one to shortest one
+    for key, val in sorted(choice_dict.items(), key=lambda x: len(x[1]), reverse=True):
+        pattern = re.compile(re.escape(val.rstrip(".")), re.IGNORECASE)
+        gen = pattern.sub(key, gen)
+    return gen
+def extract_choice(gen, choice_list):
+    # answer is A | choice is A | choose A
+    res = re.search(
+        r"(?:(?:[Cc]hoose)|(?:(?:[Aa]nswer|[Cc]hoice)(?![^ABCD]{0,20}?(?:n't|not))[^ABCD]{0,10}?\b(?:|is|:|be))\b)[^ABCD]{0,20}?\b(A|B|C|D)\b",
+        gen,
+    )
+    # A is correct | A is right
+    if res is None:
+        res = re.search(
+            r"\b(A|B|C|D)\b(?![^ABCD]{0,8}?(?:n't|not)[^ABCD]{0,5}?(?:correct|right))[^ABCD]{0,10}?\b(?:correct|right)\b",
+            gen,
+        )
+    # straight answer: A
+    if res is None:
+        res = re.search(r"^(A|B|C|D)(?:\.|,|:|$)", gen)
+    # simply extract the first appearred letter
+    if res is None:
+        res = re.search(r"(?<![a-zA-Z])(A|B|C|D)(?![a-zA-Z=])", gen)
+    if res is None:
+        return choices[choice_list.index(process.extractOne(gen, choice_list)[0])]
+    return res.group(1)
+def extract_answer(response, row):
+    gen = process_before_extraction(
+        response, {choice: row[choice] for choice in choices}
+    )
+    pred = extract_choice(gen, [row[choice] for choice in choices])
+    return pred
+@torch.no_grad()
+def eval_subject(
+    model,
+    tokenizer,
+    subject_name,
+    test_df,
+    save_result_dir=None,
+    overwrite=False,
+    **kwargs
+):
+    result_path = os.path.join(save_result_dir, f"{subject_name}_result.csv")
+    if not overwrite and os.path.exists(result_path):
+        print(f"{result_path} existed, skip!")
+        score = []
+        for (_, datarow), (_, resultrow) in zip(
+            test_df.iterrows(), pd.read_csv(result_path).iterrows()
+        ):
+            # pred = extract_answer(resultrow['model_response'], datarow)
+            pred = resultrow["model_output"]
+            correct = 1 if pred == datarow["answer"] else 0
+            score.append(correct)
+        return score
+    result = []
+    score = []
+    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
+        question = format_example(row)
+        response, _ = model.chat(
+            tokenizer,
+            question,
+            history=None,
+        )
+        print(question)
+        print(response)
+        pred = extract_answer(response, row)
+        print(pred)
+        print("======================")
+        if "answer" in row:
+            correct = 1 if pred == row["answer"] else 0
+            score.append(correct)
+            if args.debug:
+                print(f'{question} pred: {pred} ref: {row["answer"]}')
+        result.append(pred)
+    if save_result_dir:
+        test_df["model_output"] = result
+        test_df["model_response"] = response
+        if score:
+            test_df["correctness"] = score
+        os.makedirs(save_result_dir, exist_ok=True)
+        test_df.to_csv(
+            os.path.join(save_result_dir, f"{subject_name}_result.csv"),
+            encoding="utf-8",
+            index=False,
+        )
+    return score
+def cal_mmlu(res):
+    acc_sum_dict = dict()
+    acc_norm_sum_dict = dict()
+    cnt_dict = dict()
+    acc_sum = 0.0
+    cnt = 0
+    for class_ in TASK_NAME_MAPPING.keys():
+        acc_sum_dict[class_] = 0.0
+        acc_norm_sum_dict[class_] = 0.0
+        cnt_dict[class_] = 0.0
+        for tt in TASK_NAME_MAPPING[class_]:
+            acc_sum += sum(res[tt])
+            cnt += len(res[tt])
+            acc_sum_dict[class_] += sum(res[tt])
+            cnt_dict[class_] += len(res[tt])
+    print("\n\n\n")
+    for k in TASK_NAME_MAPPING.keys():
+        if k in cnt_dict:
+            print("%s ACC: %.2f " % (k, acc_sum_dict[k] * 100 / cnt_dict[k]))
+    print("AVERAGE ACC:%.2f " % (acc_sum * 100 / cnt))
+def main(args):
+    print("loading model weights")
+    if args.checkpoint_path is not None:
+        model, tokenizer = load_models_tokenizer(args)
+    else:
+        model, tokenizer = None, None
+    print("model loaded")
+    dev_result = {}
+    for subject_name in tqdm(SUBJECTS):
+        # val_file_path = os.path.join(args.eval_data_path, 'val', f'{subject_name}_val.csv')
+        # dev_file_path = os.path.join(args.eval_data_path, 'dev', f'{subject_name}_dev.csv')
+        test_file_path = os.path.join(
+            args.eval_data_path, "test", f"{subject_name}_test.csv"
+        )
+        # val_df = pd.read_csv(val_file_path, names=['question','A','B','C','D','answer'])
+        # dev_df = pd.read_csv(dev_file_path, names=['question','A','B','C','D','answer'])
+        test_df = pd.read_csv(
+            test_file_path, names=["question", "A", "B", "C", "D", "answer"]
+        )
+        score = eval_subject(
+            model,
+            tokenizer,
+            subject_name,
+            test_df,
+            save_result_dir=f"outs_chat/mmlu_eval_result",
+            overwrite=args.overwrite,
+        )
+        dev_result[subject_name] = score
+    cal_mmlu(dev_result)
+TASK_NAME_MAPPING = {
+    "stem": [
+        "abstract_algebra",
+        "anatomy",
+        "astronomy",
+        "college_biology",
+        "college_chemistry",
+        "college_computer_science",
+        "college_mathematics",
+        "college_physics",
+        "computer_security",
+        "conceptual_physics",
+        "electrical_engineering",
+        "elementary_mathematics",
+        "high_school_biology",
+        "high_school_chemistry",
+        "high_school_computer_science",
+        "high_school_mathematics",
+        "high_school_physics",
+        "high_school_statistics",
+        "machine_learning",
+    ],
+    "Humanities": [
+        "formal_logic",
+        "high_school_european_history",
+        "high_school_us_history",
+        "high_school_world_history",
+        "international_law",
+        "jurisprudence",
+        "logical_fallacies",
+        "moral_disputes",
+        "moral_scenarios",
+        "philosophy",
+        "prehistory",
+        "professional_law",
+        "world_religions",
+    ],
+    "other": [
+        "business_ethics",
+        "college_medicine",
+        "human_aging",
+        "management",
+        "marketing",
+        "medical_genetics",
+        "miscellaneous",
+        "nutrition",
+        "professional_accounting",
+        "professional_medicine",
+        "virology",
+        "global_facts",
+        "clinical_knowledge",
+    ],
+    "social": [
+        "econometrics",
+        "high_school_geography",
+        "high_school_government_and_politics",
+        "high_school_macroeconomics",
+        "high_school_microeconomics",
+        "high_school_psychology",
+        "human_sexuality",
+        "professional_psychology",
+        "public_relations",
+        "security_studies",
+        "sociology",
+        "us_foreign_policy",
+    ],
+}
+SUBJECTS = [v for vl in TASK_NAME_MAPPING.values() for v in vl]
+choices = ["A", "B", "C", "D"]
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B-Chat",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    # Provide extra arguments required for tasks
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument("-d", "--eval_data_path", type=str, help="Path to eval data")
+    group.add_argument(
+        "--debug", action="store_true", default=False, help="Print infos."
+    )
+    group.add_argument(
+        "--overwrite",
+        action="store_true",
+        default=False,
+        help="Overwrite existed results",
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/evaluate_cmmlu.py ADDED Viewed

	@@ -0,0 +1,325 @@

+import os
+import pandas as pd
+import numpy as np
+import argparse
+import datasets
+import torch
+from collections import defaultdict
+from typing import List
+from tqdm import tqdm
+from transformers.trainer_utils import set_seed
+"""
+wget https://huggingface.co/datasets/haonan-li/cmmlu/resolve/main/cmmlu_v1_0_1.zip
+mkdir data/cmmlu
+mv cmmlu_v1_0_1.zip data/cmmlu
+cd data/cmmlu; unzip cmmlu_v1_0_1.zip
+cd ../../
+python evaluate_cmmlu.py -d data/cmmlu/
+"""
+def load_models_tokenizer(args):
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from transformers.generation import GenerationConfig
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    return model, tokenizer
+def format_example(line, include_answer=True):
+    example = "问题：" + line["Question"]
+    for choice in choices:
+        example += f'\n{choice}. {line[f"{choice}"]}'
+    if include_answer:
+        example += "\n答案：" + line["Answer"] + "\n\n"
+    else:
+        example += "\n答案："
+    return example
+def generate_few_shot_prompt(k, subject, dev_df):
+    prompt = ""
+    if k == -1:
+        k = dev_df.shape[0]
+    for i in range(k):
+        prompt += format_example(
+            dev_df.iloc[i, :],
+            include_answer=True,
+        )
+    return prompt
+def get_logits(tokenizer, model, inputs: List[str]):
+    input_ids = tokenizer(inputs, padding=False)["input_ids"]
+    input_ids = torch.tensor(input_ids, device=model.device)
+    tokens = {"input_ids": input_ids}
+    outputs = model(input_ids)["logits"]
+    logits = outputs[:, -1, :]
+    log_probs = torch.nn.functional.softmax(logits, dim=-1)
+    return log_probs, {"tokens": tokens}
+@torch.no_grad()
+def eval_subject(
+    model,
+    tokenizer,
+    subject_name,
+    test_df,
+    k=5,
+    dev_df=None,
+    few_shot=False,
+    save_result_dir=None,
+    **kwargs,
+):
+    result = []
+    score = []
+    few_shot_prompt = (
+        generate_few_shot_prompt(k, subject_name, dev_df) if few_shot else []
+    )
+    all_probs = {"prob_A": [], "prob_B": [], "prob_C": [], "prob_D": []}
+    if args.debug:
+        print(f"few_shot_prompt: {few_shot_prompt}")
+    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
+        question = format_example(row, include_answer=False)
+        full_prompt = few_shot_prompt + question
+        output, input_info = get_logits(tokenizer, model, [full_prompt])
+        assert output.shape[0] == 1
+        logits = output.flatten()
+        softval = torch.nn.functional.softmax(
+            torch.tensor(
+                [
+                    logits[tokenizer("A")["input_ids"]],
+                    logits[tokenizer("B")["input_ids"]],
+                    logits[tokenizer("C")["input_ids"]],
+                    logits[tokenizer("D")["input_ids"]],
+                ]
+            ),
+            dim=0,
+        )
+        if softval.dtype in {torch.bfloat16, torch.float16}:
+            softval = softval.to(dtype=torch.float32)
+        probs = softval.detach().cpu().numpy()
+        for i, choice in enumerate(choices):
+            all_probs[f"prob_{choice}"].append(probs[i])
+        pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)]
+        if "Answer" in row:
+            correct = 1 if pred == row["Answer"] else 0
+            score.append(correct)
+            if args.debug:
+                print(f'{question} pred: {pred} ref: {row["Answer"]}')
+        result.append(pred)
+    if score:
+        correct_ratio = 100 * sum(score) / len(score)
+        if args.debug:
+            print(subject_name, correct_ratio)
+    else:
+        correct_ratio = 0
+    if save_result_dir:
+        test_df["model_output"] = result
+        for i, choice in enumerate(choices):
+            test_df[f"prob_{choice}"] = all_probs[f"prob_{choice}"]
+        if score:
+            test_df["correctness"] = score
+        os.makedirs(save_result_dir, exist_ok=True)
+        test_df.to_csv(
+            os.path.join(save_result_dir, f"{subject_name}_result.csv"),
+            encoding="utf-8",
+            index=False,
+        )
+    return correct_ratio
+def cal_cmmlu(res):
+    print("\n\n\n")
+    res = {k.split("-")[-1]: float(v) for k, v in res.items()}
+    for k, v in TASK_NAME_MAPPING.items():
+        avg_acc = np.mean(list(map(lambda x: res[x], v)))
+        print(f"{k} acc: {avg_acc:.2f}")
+    avg_all_acc = np.mean(list(res.values()))
+    print(f"AVERAGE acc: {avg_all_acc:.2f}")
+subcategories = {
+    "agronomy": ["other"],
+    "anatomy": ["biology"],
+    "ancient_chinese": ["linguistics", "china specific"],
+    "arts": ["arts"],
+    "astronomy": ["physics"],
+    "business_ethics": ["business"],
+    "chinese_civil_service_exam": ["politics", "china specific"],
+    "chinese_driving_rule": ["other", "china specific"],
+    "chinese_food_culture": ["culture", "china specific"],
+    "chinese_foreign_policy": ["politics", "china specific"],
+    "chinese_history": ["history", "china specific"],
+    "chinese_literature": ["literature", "china specific"],
+    "chinese_teacher_qualification": ["education", "china specific"],
+    "college_actuarial_science": ["math"],
+    "college_education": ["education"],
+    "college_engineering_hydrology": ["engineering"],
+    "college_law": ["law"],
+    "college_mathematics": ["math"],
+    "college_medical_statistics": ["statistics"],
+    "clinical_knowledge": ["other"],
+    "college_medicine": ["other"],
+    "computer_science": ["computer science"],
+    "computer_security": ["other"],
+    "conceptual_physics": ["physics"],
+    "construction_project_management": ["other", "china specific"],
+    "economics": ["economics"],
+    "education": ["education"],
+    "elementary_chinese": ["linguistics", "china specific"],
+    "elementary_commonsense": ["other", "china specific"],
+    "elementary_information_and_technology": ["other"],
+    "electrical_engineering": ["engineering"],
+    "elementary_mathematics": ["math"],
+    "ethnology": ["culture", "china specific"],
+    "food_science": ["other"],
+    "genetics": ["biology"],
+    "global_facts": ["global"],
+    "high_school_biology": ["biology"],
+    "high_school_chemistry": ["chemistry"],
+    "high_school_geography": ["geography"],
+    "high_school_mathematics": ["math"],
+    "high_school_physics": ["physics"],
+    "high_school_politics": ["politics", "china specific"],
+    "human_sexuality": ["other"],
+    "international_law": ["law"],
+    "journalism": ["sociology"],
+    "jurisprudence": ["law"],
+    "legal_and_moral_basis": ["other"],
+    "logical": ["philosophy"],
+    "machine_learning": ["computer science"],
+    "management": ["business"],
+    "marketing": ["business"],
+    "marxist_theory": ["philosophy"],
+    "modern_chinese": ["linguistics", "china specific"],
+    "nutrition": ["other"],
+    "philosophy": ["philosophy"],
+    "professional_accounting": ["business"],
+    "professional_law": ["law"],
+    "professional_medicine": ["other"],
+    "professional_psychology": ["psychology"],
+    "public_relations": ["politics"],
+    "security_study": ["politics"],
+    "sociology": ["culture"],
+    "sports_science": ["other"],
+    "traditional_chinese_medicine": ["other", "china specific"],
+    "virology": ["biology"],
+    "world_history": ["history"],
+    "world_religions": ["global"],
+}
+categories = {
+    "STEM": [
+        "physics",
+        "chemistry",
+        "biology",
+        "computer science",
+        "math",
+        "engineering",
+        "statistics",
+    ],
+    "Humanities": ["history", "philosophy", "law", "arts", "literature", "global"],
+    "Social Science": [
+        "linguistics",
+        "business",
+        "politics",
+        "culture",
+        "economics",
+        "geography",
+        "psychology",
+        "education",
+        "sociology",
+    ],
+    "Other": ["other"],
+    "China specific": ["china specific"],
+}
+TASK_NAME_MAPPING = defaultdict(list)
+for k, v in categories.items():
+    for subject, subcat in subcategories.items():
+        for c in subcat:
+            if c in v:
+                TASK_NAME_MAPPING[k].append(subject)
+choices = ["A", "B", "C", "D"]
+def main(args):
+    model, tokenizer = load_models_tokenizer(args)
+    test_result = {}
+    for subject_name in tqdm(subcategories.keys()):
+        dev_file_path = os.path.join(args.eval_data_path, "dev", f"{subject_name}.csv")
+        test_file_path = os.path.join(
+            args.eval_data_path, "test", f"{subject_name}.csv"
+        )
+        dev_df = pd.read_csv(dev_file_path)
+        test_df = pd.read_csv(test_file_path)
+        score = eval_subject(
+            model,
+            tokenizer,
+            subject_name,
+            dev_df=dev_df,
+            test_df=test_df,
+            k=5,
+            few_shot=True,
+            save_result_dir=f"outs/cmmlu_eval_result",
+        )
+        test_result[subject_name] = score
+    cal_cmmlu(test_result)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    """Provide extra arguments required for tasks."""
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument(
+        "-d", "--eval_data_path", type=str, required=True, help="Path to eval data"
+    )
+    group.add_argument(
+        "--max-seq-len",
+        type=int,
+        default=2048,
+        help="Size of the output generated text.",
+    )
+    group.add_argument(
+        "--debug", action="store_true", default=False, help="Print infos."
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/evaluate_gsm8k.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import re
+import torch
+import argparse
+import jsonlines
+import numpy as np
+import datasets
+from datasets import load_from_disk, load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+ANS_RE = re.compile(r"#### (\-?[0-9\.\,]+)")
+INVALID_ANS = "[invalid]"
+def doc_to_text(doc):
+    return (
+        fewshot_prompt
+        + "\nQuestion: "
+        + doc["question"]
+        + "\nLet's think step by step\n"
+    )
+def decode(tokens_list, tokenizer, raw_text_len):
+    sents = []
+    # print(len(tokens_list))
+    for tokens in tokens_list:
+        tokens = tokens.cpu().numpy().tolist()
+        sent = tokenizer.tokenizer.decode(tokens[raw_text_len:])
+        sent = sent.split("<|endoftext|>")[0]
+        sent = sent.split("\n\n\n")[0]
+        sent = sent.split("\n\n")[0]
+        sent = sent.split("Question:")[0]
+        sents.append(sent)
+    return sents
+def generate_sample(model, tokenizer, input_txt):
+    input_ids = tokenizer.tokenizer.encode(input_txt)
+    raw_text_len = len(input_ids)
+    context_enc = torch.tensor([input_ids]).to(model.device)
+    print(f"Input text: {input_txt}\n")
+    outputs = model.generate(context_enc)
+    output_text = decode(outputs, tokenizer, raw_text_len)[0]
+    print(f"\nOutput text: {output_text}\n")
+    return output_text
+def extract_answer_hf(completion):
+    match = ANS_RE.search(completion)
+    if match:
+        match_str = match.group(1).strip()
+        match_str = match_str.replace(",", "")
+        return eval(match_str)
+    else:
+        return INVALID_ANS
+def extract_answer(completion):
+    try:
+        last_number = re.findall(r"\d+", completion)[-1]
+        return eval(last_number)
+    except:
+        return INVALID_ANS
+def is_correct(completion, answer):
+    gold = extract_answer_hf(answer)
+    assert gold != INVALID_ANS, "No ground truth answer found in the document."
+    return extract_answer(completion) == gold
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B",
+    )
+    parser.add_argument("-f", "--sample-input-file", type=str, default=None)
+    parser.add_argument(
+        "-o", "--sample-output-file", type=str, default="gsm8k_res.jsonl"
+    )
+    args = parser.parse_args()
+    fewshot_prompt = open("gsm8k_prompt.txt").read()
+    if args.sample_input_file is not None:
+        dataset = load_from_disk(args.sample_input_file)
+    else:
+        config = datasets.DownloadConfig(resume_download=True, max_retries=100)
+        dataset = load_dataset("gsm8k", "main", download_config=config)
+    test = dataset["test"]
+    print("Loading tokenizer ...")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    print("Loading model ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False
+    f_output = jsonlines.Writer(open(args.sample_output_file, "w", encoding="utf-8"))
+    tot_length = test.num_rows
+    acc_res = []
+    for doc in test:
+        context = doc_to_text(doc)
+        completion = generate_sample(model, tokenizer, context)
+        answer = doc["answer"]
+        acc = is_correct(completion, answer)
+        doc["completion"] = completion
+        doc["acc"] = acc
+        f_output.write(doc)
+        acc_res.append(acc)
+    f_output.close()
+    print("Acc: ", np.mean(acc_res))

eval/evaluate_humaneval.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import argparse
+import tqdm
+import torch
+import jsonlines
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+"""
+git clone https://github.com/openai/human-eval
+$ pip install -e human-eval
+evaluate_functional_correctness sample-output-file
+"""
+def decode(tokens_list, tokenizer, raw_text_len):
+    sents = []
+    # print(len(tokens_list))
+    for tokens in tokens_list:
+        tokens = tokens.cpu().numpy().tolist()
+        sent = tokenizer.tokenizer.decode(tokens[raw_text_len:])
+        sent = sent.split("<|endoftext|>")[0]
+        sent = sent.split("\n\n\n")[0]
+        sent = sent.split("\n\n")[0]
+        sent = sent.split("def ")[0]
+        sents.append(sent)
+    return sents
+def generate_sample(model, tokenizer, input_txt):
+    input_ids = tokenizer.tokenizer.encode(input_txt)
+    raw_text_len = len(input_ids)
+    context_enc = torch.tensor([input_ids]).to(model.device)
+    print(f"Input text: {input_txt}\n")
+    outputs = model.generate(context_enc)
+    output_text = decode(outputs, tokenizer, raw_text_len)[0]
+    print(f"\nOutput text: \n{output_text}\n")
+    return output_text
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B",
+    )
+    parser.add_argument(
+        "-f",
+        "--sample-input-file",
+        type=str,
+        default=None,
+        help="data path to HumanEval.jsonl",
+    )
+    parser.add_argument(
+        "-o", "--sample-output-file", type=str, default="HumanEval_res.jsonl"
+    )
+    args = parser.parse_args()
+    print("Loading tokenizer ...")
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    print("Loading model ...")
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False
+    f_output = jsonlines.Writer(open(args.sample_output_file, "w", encoding="utf-8"))
+    f = jsonlines.open(args.sample_input_file)
+    with f_output as output:
+        for jobj in tqdm.tqdm(f, desc="task_idx"):
+            prompt = jobj["prompt"]
+            task_id = jobj["task_id"]
+            gen_sents = generate_sample(model, tokenizer, prompt)
+            gen_jobjs = {"task_id": task_id, "completion": gen_sents}
+            output.write(gen_jobjs)
+    f_output.close()

eval/evaluate_mmlu.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import os
+from typing import List
+import pandas as pd
+import numpy as np
+import argparse
+import torch
+from tqdm import tqdm
+from transformers.trainer_utils import set_seed
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+"""
+wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
+mkdir data/mmlu
+mv data.tar data/mmlu
+cd data/mmlu; tar xf data.tar
+cd ../../
+python eval/evaluate_mmlu.py -d data/mmlu/data/
+"""
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path, device_map="auto", trust_remote_code=True
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    return model, tokenizer
+def format_example(line, include_answer=True):
+    example = "Question: " + line["question"]
+    for choice in choices:
+        example += f'\n{choice}. {line[f"{choice}"]}'
+    if include_answer:
+        example += "\nAnswer: " + line["answer"] + "\n\n"
+    else:
+        example += "\nAnswer:"
+    return example
+def generate_few_shot_prompt(k, subject, dev_df):
+    def format_subject(subject):
+        l = subject.split("_")
+        s = ""
+        for entry in l:
+            s += " " + entry
+        return s.strip()
+    prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(
+        format_subject(subject)
+    )
+    if k == -1:
+        k = dev_df.shape[0]
+    for i in range(k):
+        prompt += format_example(
+            dev_df.iloc[i, :],
+            include_answer=True,
+        )
+    return prompt
+def get_logits(tokenizer, model, inputs: List[str]):
+    input_ids = tokenizer(inputs, padding=False)["input_ids"]
+    input_ids = torch.tensor(input_ids, device=model.device)
+    if input_ids.shape[1] > args.max_seq_len:
+        input_ids = input_ids[:, input_ids.shape[1] - args.max_seq_len + 1 :]
+    tokens = {"input_ids": input_ids}
+    outputs = model(input_ids)["logits"]
+    logits = outputs[:, -1, :]
+    log_probs = torch.nn.functional.softmax(logits, dim=-1)
+    return log_probs, {"tokens": tokens}
+@torch.no_grad()
+def eval_subject(
+    model,
+    tokenizer,
+    subject_name,
+    test_df,
+    k=5,
+    dev_df=None,
+    few_shot=False,
+    save_result_dir=None,
+    **kwargs,
+):
+    result = []
+    score = []
+    few_shot_prompt = (
+        generate_few_shot_prompt(k, subject_name, dev_df) if few_shot else []
+    )
+    all_probs = {"prob_A": [], "prob_B": [], "prob_C": [], "prob_D": []}
+    if args.debug:
+        print(f"few_shot_prompt: {few_shot_prompt}")
+    for _, row in tqdm(test_df.iterrows(), total=len(test_df)):
+        question = format_example(row, include_answer=False)
+        full_prompt = few_shot_prompt + question
+        output, input_info = get_logits(tokenizer, model, [full_prompt])
+        assert output.shape[0] == 1
+        logits = output.flatten()
+        softval = torch.nn.functional.softmax(
+            torch.tensor(
+                [
+                    logits[tokenizer(" A")["input_ids"]],
+                    logits[tokenizer(" B")["input_ids"]],
+                    logits[tokenizer(" C")["input_ids"]],
+                    logits[tokenizer(" D")["input_ids"]],
+                ]
+            ),
+            dim=0,
+        )
+        if softval.dtype in {torch.bfloat16, torch.float16}:
+            softval = softval.to(dtype=torch.float32)
+        probs = softval.detach().cpu().numpy()
+        for i, choice in enumerate(choices):
+            all_probs[f"prob_{choice}"].append(probs[i])
+        pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)]
+        if "answer" in row:
+            correct = 1 if pred == row["answer"] else 0
+            score.append(correct)
+            if args.debug:
+                print(f'{question} pred: {pred} ref: {row["answer"]}')
+        result.append(pred)
+    if save_result_dir:
+        test_df["model_output"] = result
+        for i, choice in enumerate(choices):
+            test_df[f"prob_{choice}"] = all_probs[f"prob_{choice}"]
+        if score:
+            test_df["correctness"] = score
+        os.makedirs(save_result_dir, exist_ok=True)
+        test_df.to_csv(
+            os.path.join(save_result_dir, f"{subject_name}_result.csv"),
+            encoding="utf-8",
+            index=False,
+        )
+    return score
+def cal_mmlu(res):
+    acc_sum_dict = dict()
+    acc_norm_sum_dict = dict()
+    cnt_dict = dict()
+    acc_sum = 0.0
+    cnt = 0
+    hard_cnt = 0
+    hard_acc_sum = 0.0
+    for class_ in TASK_NAME_MAPPING.keys():
+        acc_sum_dict[class_] = 0.0
+        acc_norm_sum_dict[class_] = 0.0
+        cnt_dict[class_] = 0.0
+        for tt in TASK_NAME_MAPPING[class_]:
+            acc_sum += sum(res[tt])
+            cnt += len(res[tt])
+            acc_sum_dict[class_] += sum(res[tt])
+            cnt_dict[class_] += len(res[tt])
+    print("\n\n\n", "total cnt:", cnt, "\n")
+    for k in TASK_NAME_MAPPING.keys():
+        if k in cnt_dict:
+            print("%s ACC: %.2f " % (k, acc_sum_dict[k] / cnt_dict[k] * 100))
+    print("AVERAGE ACC:%.2f " % (acc_sum / cnt * 100))
+def main(args):
+    model, tokenizer = load_models_tokenizer(args)
+    dev_result = {}
+    for subject_name in tqdm(SUBJECTS):
+        # val_file_path = os.path.join(args.eval_data_path, 'val', f'{subject_name}_val.csv')
+        dev_file_path = os.path.join(
+            args.eval_data_path, "dev", f"{subject_name}_dev.csv"
+        )
+        test_file_path = os.path.join(
+            args.eval_data_path, "test", f"{subject_name}_test.csv"
+        )
+        # val_df = pd.read_csv(val_file_path, names=['question','A','B','C','D','answer'])
+        dev_df = pd.read_csv(
+            dev_file_path, names=["question", "A", "B", "C", "D", "answer"]
+        )
+        test_df = pd.read_csv(
+            test_file_path, names=["question", "A", "B", "C", "D", "answer"]
+        )
+        score = eval_subject(
+            model,
+            tokenizer,
+            subject_name,
+            test_df,
+            dev_df=dev_df,
+            k=5,
+            few_shot=True,
+            save_result_dir=f"outs/mmlu_eval_result",
+        )
+        dev_result[subject_name] = score
+    cal_mmlu(dev_result)
+TASK_NAME_MAPPING = {
+    "stem": [
+        "abstract_algebra",
+        "anatomy",
+        "astronomy",
+        "college_biology",
+        "college_chemistry",
+        "college_computer_science",
+        "college_mathematics",
+        "college_physics",
+        "computer_security",
+        "conceptual_physics",
+        "electrical_engineering",
+        "elementary_mathematics",
+        "high_school_biology",
+        "high_school_chemistry",
+        "high_school_computer_science",
+        "high_school_mathematics",
+        "high_school_physics",
+        "high_school_statistics",
+        "machine_learning",
+    ],
+    "Humanities": [
+        "formal_logic",
+        "high_school_european_history",
+        "high_school_us_history",
+        "high_school_world_history",
+        "international_law",
+        "jurisprudence",
+        "logical_fallacies",
+        "moral_disputes",
+        "moral_scenarios",
+        "philosophy",
+        "prehistory",
+        "professional_law",
+        "world_religions",
+    ],
+    "other": [
+        "business_ethics",
+        "college_medicine",
+        "human_aging",
+        "management",
+        "marketing",
+        "medical_genetics",
+        "miscellaneous",
+        "nutrition",
+        "professional_accounting",
+        "professional_medicine",
+        "virology",
+        "global_facts",
+        "clinical_knowledge",
+    ],
+    "social": [
+        "econometrics",
+        "high_school_geography",
+        "high_school_government_and_politics",
+        "high_school_macroeconomics",
+        "high_school_microeconomics",
+        "high_school_psychology",
+        "human_sexuality",
+        "professional_psychology",
+        "public_relations",
+        "security_studies",
+        "sociology",
+        "us_foreign_policy",
+    ],
+}
+SUBJECTS = [v for vl in TASK_NAME_MAPPING.values() for v in vl]
+choices = ["A", "B", "C", "D"]
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    parser.add_argument("--gpu", type=int, default=0, help="gpu id")
+    """Provide extra arguments required for tasks."""
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument("-d", "--eval_data_path", type=str, help="Path to eval data")
+    group.add_argument(
+        "--max-seq-len",
+        type=int,
+        default=2048,
+        help="Size of the output generated text.",
+    )
+    group.add_argument(
+        "--debug", action="store_true", default=False, help="Print infos."
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/evaluate_plugin.py ADDED Viewed

	@@ -0,0 +1,325 @@

+import argparse
+import json
+import os
+import pprint
+import json5
+import jsonlines
+from rouge_score import rouge_scorer
+from tqdm import tqdm
+from transformers import Agent, AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+from transformers.tools.evaluate_agent import evaluate_agent
+from transformers.trainer_utils import set_seed
+data_root_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "data")
+def is_callable(response, golden):
+    return response["action"].strip().lower() == golden["action"].strip().lower()
+def process_res(response):
+    # parse response
+    response += "\n"  # fix not-find bug
+    thought = response[: response.find("Action:")].strip()
+    action = response[
+        response.find("Action:") + len("Action:") : response.find("Action Input:")
+    ].strip()
+    action_input = response[
+        response.find("Action Input:")
+        + len("Action Input:") : response.find("Observation:")
+    ].strip()
+    # TODO: This parsing result is incorrect if the response contains multiple Actions. To be fixed in the future.
+    observation = response[
+        response.find("Observation:") + len("Observation:") : response.rfind("Thought:")
+    ].strip()
+    thought_last = response[
+        response.rfind("Thought:") + len("Thought:") : response.find("Final Answer:")
+    ].strip()
+    final_answer = response[
+        response.find("Final Answer:") + len("Final Answer:") :
+    ].strip()
+    try:
+        action_input = json.dumps(
+            json5.loads(action_input), ensure_ascii=False, sort_keys=True
+        )
+    except:
+        # print("JSON Load Error:", action_input)
+        pass
+    res_dict = {
+        "thought": thought,
+        "action": action,
+        "action_input": action_input,
+        "observation": observation,
+        "thought_last": thought_last,
+        "final_answer": final_answer,
+    }
+    return res_dict
+class _DummyTokenizer:
+    def tokenize(self, text: str):
+        return text.split()
+def _get_tokenized_string(tokenizer, text_list):
+    token_ids_list, tokenized_string_list = [], []
+    for text in text_list:
+        assert tokenizer is not None
+        token_ids = tokenizer.encode(text)
+        tokens_bytes = tokenizer.convert_ids_to_tokens(token_ids)
+        tokens = [token.decode("utf-8", errors="replace") for token in tokens_bytes]
+        tokenized_string = " ".join(tokens)
+        token_ids_list.append(token_ids)
+        tokenized_string_list.append(tokenized_string)
+    return token_ids_list, tokenized_string_list
+def eval_action(job):
+    response = job["gen"][0]
+    golden = job["response"]
+    if "Action:" in response:
+        response, golden = process_res(response), process_res(golden)
+        if is_callable(response, golden):
+            return True
+    return False
+def eval_action_input(job, tokenizer):
+    response = job["gen"][0]
+    golden = job["response"]
+    response, golden = process_res(response), process_res(golden)
+    query = job["prompt"]
+    job = {}
+    job["prompt"] = query
+    job["gen"] = response["action_input"]
+    job["response"] = golden["action_input"]
+    job["_gen_tok"], job["_gen_tok_str"] = _get_tokenized_string(
+        tokenizer, [response["action_input"]]
+    )
+    job["_reference_tok"], job["_reference_tok_str"] = _get_tokenized_string(
+        tokenizer, [golden["action_input"]]
+    )
+    scorer = rouge_scorer.RougeScorer(
+        ["rouge1", "rouge2", "rougeL"], tokenizer=_DummyTokenizer()
+    )
+    score = scorer.score(job["_reference_tok_str"][0], job["_gen_tok_str"][0])
+    rouge = score["rougeL"].fmeasure
+    return rouge
+class QWenAgent(Agent):
+    """
+    Agent that uses QWen model and tokenizer to generate code.
+    Example:
+    ```py
+    agent = QWenAgent()
+    agent.run("Draw me a picture of rivers and lakes.")
+    ```
+    """
+    def __init__(
+        self,
+        chat_prompt_template=None,
+        run_prompt_template=None,
+        additional_tools=None,
+        tokenizer=None,
+        model=None,
+    ):
+        if tokenizer and model:
+            self.tokenizer = tokenizer
+            self.model = model
+        else:
+            checkpoint = "Qwen/Qwen-7B-Chat"
+            self.tokenizer = AutoTokenizer.from_pretrained(
+                checkpoint, trust_remote_code=True
+            )
+            self.model = (
+                AutoModelForCausalLM.from_pretrained(
+                    checkpoint, device_map="auto", trust_remote_code=True
+                )
+                .cuda()
+                .eval()
+            )
+            self.model.generation_config = GenerationConfig.from_pretrained(
+                checkpoint, trust_remote_code=True
+            )  # 可指定不同的生成长度、top_p等相关超参
+            self.model.generation_config.do_sample = False  # greedy
+        super().__init__(
+            chat_prompt_template=chat_prompt_template,
+            run_prompt_template=run_prompt_template,
+            additional_tools=additional_tools,
+        )
+    def generate_one(self, prompt, stop):
+        # "Human:" 和 "Assistant:" 曾为通义千问的特殊保留字，需要替换为 "_HUMAN_:" 和 "_ASSISTANT_:"。这一问题将在未来版本修复。
+        prompt = prompt.replace("Human:", "_HUMAN_:").replace(
+            "Assistant:", "_ASSISTANT_:"
+        )
+        stop = [
+            item.replace("Human:", "_HUMAN_:").replace("Assistant:", "_ASSISTANT_:")
+            for item in stop
+        ]
+        result, _ = self.model.chat(self.tokenizer, prompt, history=None)
+        for stop_seq in stop:
+            if result.endswith(stop_seq):
+                result = result[: -len(stop_seq)]
+        result = result.replace("_HUMAN_:", "Human:").replace(
+            "_ASSISTANT_:", "Assistant:"
+        )
+        return result
+def load_models_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path,
+        device_map="auto",
+        trust_remote_code=True,
+        bf16=True,
+        use_flash_attn=True,
+    ).eval()
+    model.generation_config = GenerationConfig.from_pretrained(
+        args.checkpoint_path, trust_remote_code=True
+    )
+    model.generation_config.do_sample = False  # use greedy decoding
+    return model, tokenizer
+def load_jobs(filename):
+    jobs = []
+    with jsonlines.open(os.path.join(data_root_path, filename), mode="r") as reader:
+        for job in reader:
+            jobs.append(job)
+    return jobs
+def react_inference(filename, model, tokenizer):
+    filename_cache = filename + ".cache"
+    if os.path.exists(os.path.join(data_root_path, filename_cache)):
+        jobs = load_jobs(filename=filename_cache)
+        print("Loaded from", filename_cache)
+    else:
+        with open(os.path.join(data_root_path, filename_cache), "w") as f:
+            jobs = load_jobs(filename=filename)
+            print("Inference:", filename)
+            for job in tqdm(jobs):
+                response, history = model.chat(tokenizer, job["prompt"], history=None)
+                job["gen"] = [response]
+                f.writelines(json.dumps(job, ensure_ascii=False) + "\n")
+        print(filename_cache, "is saved.")
+    return jobs
+def main(args):
+    print("loading model weights")
+    if args.checkpoint_path is not None:
+        model, tokenizer = load_models_tokenizer(args)
+    else:
+        model, tokenizer = None, None
+    print("model loaded")
+    result = {}
+    # eval react positive
+    if args.eval_react_positive:
+        print("eval react positive ...")
+        acc_count = 0
+        rouge_mean = 0
+        jobs = react_inference(
+            filename=args.eval_react_positive_filename, model=model, tokenizer=tokenizer
+        )
+        for job in jobs:
+            if eval_action(job):
+                acc_count += 1
+            rouge = eval_action_input(job, tokenizer)
+            rouge_mean += rouge / len(jobs)
+        scores = {
+            "action_right_rate": acc_count / len(jobs),
+            "action_input_rouge": rouge_mean,
+        }
+        result.update({"react_positive": scores})
+    # eval react negative
+    if args.eval_react_negative:
+        print("eval react negative ...")
+        bad_count = 0
+        jobs = react_inference(
+            filename=args.eval_react_negative_filename, model=model, tokenizer=tokenizer
+        )
+        for job in jobs:
+            if "\nAction:" in job["gen"][0]:
+                bad_count += 1
+        scores = {"bad_rate": bad_count / len(jobs)}
+        result.update({"react_negative": scores})
+    # eval hfagent
+    if args.eval_hfagent:
+        print("eval hfagent ...")
+        agent = QWenAgent(model=model, tokenizer=tokenizer)
+        scores = evaluate_agent(agent, verbose=False, return_errors=False)
+        result.update({"hfagent": scores})
+    pp = pprint.PrettyPrinter(indent=4)
+    pp.pprint(result)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Test HF checkpoint.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        help="Checkpoint path",
+        default="Qwen/Qwen-7B-Chat",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    """Provide extra arguments required for tasks."""
+    group = parser.add_argument_group(title="Evaluation options")
+    group.add_argument(
+        "--eval-react-positive",
+        action="store_true",
+        default=False,
+        help="Eval react positive.",
+    )
+    group.add_argument(
+        "--eval-react-positive-filename",
+        type=str,
+        default="exam_plugin_v1_react_positive.jsonl",
+        help="Eval react positive filename.",
+    )
+    group.add_argument(
+        "--eval-react-negative",
+        action="store_true",
+        default=False,
+        help="Eval react negative.",
+    )
+    group.add_argument(
+        "--eval-react-negative-filename",
+        type=str,
+        default="exam_plugin_v1_react_negative.jsonl",
+        help="Eval react negative filename.",
+    )
+    group.add_argument(
+        "--eval-hfagent", action="store_true", default=False, help="Eval hfagent."
+    )
+    args = parser.parse_args()
+    set_seed(args.seed)
+    main(args)

eval/gsm8k_prompt.txt ADDED Viewed

	@@ -0,0 +1,59 @@

+Question: In 2004, there were 60 kids at a cookout. In 2005, half the number of kids came to the cookout as compared to 2004. In 2006, 2/3 as many kids came to the cookout as in 2005. How many kids came to the cookout in 2006?
+Let's think step by step
+In 2005, 60/2=30 kids came to the cookout.
+In 2006, 30/3*2=20 kids came to the cookout.
+The answer is 20
+Question: Zilla spent 7% of her monthly earnings on rent, half of it on her other monthly expenses, and put the rest in her savings. If she spent $133 on her rent, how much does she deposit into her savings account in a month?
+Let's think step by step
+Since $133 is equal to 7% of her earnings, then 1% is equal to $133/7 = $19.
+The total monthly earning of Zilla is represented by 100%, so $19 x 100 = $1900 is her monthly earnings.
+So, $1900/2 = $950 is spent on her other monthly expenses.
+The total amount spent on the rent and other monthly expenses is $133 + $950 = $1083.
+Hence, she saves $1900 - $1083 = $817 per month.
+The answer is 817
+Question: If Buzz bought a pizza with 78 slices at a restaurant and then decided to share it with the waiter in the ratio of 5:8, with Buzz's ratio being 5, what's twenty less the number of slices of pizza that the waiter ate?
+Let's think step by step
+The total ratio representing the slices of pizza that Buzz bought is 5+8=13
+If he shared the slices of pizza with the waiter, the waiter received a fraction of 8/13 of the total number of slices, which totals 8/13 * 78 = 48 slices
+Twenty less the number of slices of pizza that the waiter ate is 48-20 = 28
+The answer is 28
+Question: Jame gets a raise to $20 per hour and works 40 hours a week.  His old job was $16 an hour for 25 hours per week.  How much more money does he make per year in his new job than the old job if he works 52 weeks a year?
+Let's think step by step
+He makes 20*40=$800 per week
+He used to make 16*25=$400 per week
+So his raise was 800-400=$400 per week
+So he makes 400*52=$20,800 per year more
+The answer is 20800
+Question: Mr. Gardner bakes 20 cookies, 25 cupcakes, and 35 brownies for his second-grade class of 20 students. If he wants to give each student an equal amount of sweet treats, how many sweet treats will each student receive?
+Let's think step by step
+Mr. Gardner bakes a total of 20 + 25 + 35 = 80 sweet treats
+Each student will receive 80 / 20 = 4 sweet treats
+The answer is 4
+Question: A used car lot has 24 cars and motorcycles (in total) for sale. A third of the vehicles are motorcycles, and a quarter of the cars have a spare tire included. How many tires are on the used car lot’s vehicles in all?
+Let's think step by step
+The used car lot has 24 / 3 = 8 motorcycles with 2 tires each.
+The lot has 24 - 8 = 16 cars for sale
+There are 16 / 4 = 4 cars with a spare tire with 5 tires each.
+The lot has 16 - 4 = 12 cars with 4 tires each.
+Thus, the used car lot’s vehicles have 8 * 2 + 4 * 5 + 12 * 4 = 16 + 20 + 48 = 84 tires in all.
+The answer is 84
+Question: Norma takes her clothes to the laundry. She leaves 9 T-shirts and twice as many sweaters as T-shirts in the washer. When she returns she finds 3 sweaters and triple the number of T-shirts. How many items are missing?
+Let's think step by step
+Norma left 9 T-shirts And twice as many sweaters, she took 9 * 2= 18 sweaters
+Adding the T-shirts and sweaters, Norma left 9 + 18 = 27 clothes
+When she came back, she found 3 sweaters And triple the number of T-shirts, she found 3 * 3 = 9 T-shirts
+Adding the T-shirts and sweaters, Norma found 3 + 9 = 12 clothes
+Subtracting the clothes she left from the clothes she found, 27 - 12 = 15 clothes are missing
+The answer is 15
+Question: Adam has an orchard. Every day for 30 days he picks 4 apples from his orchard. After a month, Adam has collected all the remaining apples, which were 230. How many apples in total has Adam collected from his orchard?
+Let's think step by step
+During 30 days Adam picked 4 * 30 = 120 apples.
+So in total with all the remaining apples, he picked 120 + 230 = 350 apples from his orchard.
+The answer is 350