znskiss commited on
Commit
ade0520
1 Parent(s): 81bb514

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +7 -0
  2. .github/ISSUE_TEMPLATE/bug_report.yaml +88 -0
  3. .github/ISSUE_TEMPLATE/config.yaml +1 -0
  4. .github/ISSUE_TEMPLATE/feature_request.yaml +78 -0
  5. .gitignore +11 -0
  6. FAQ.md +85 -0
  7. FAQ_zh.md +80 -0
  8. LICENSE +53 -0
  9. NOTICE +52 -0
  10. README.md +432 -7
  11. README_CN.md +436 -0
  12. README_JA.md +435 -0
  13. assets/cli_demo.gif +3 -0
  14. assets/hfagent_chat_1.png +3 -0
  15. assets/hfagent_chat_2.png +3 -0
  16. assets/hfagent_run.png +3 -0
  17. assets/logo.jpg +0 -0
  18. assets/openai_api.gif +3 -0
  19. assets/performance.png +0 -0
  20. assets/qwen_tokenizer.png +0 -0
  21. assets/react_showcase_001.png +0 -0
  22. assets/react_showcase_002.png +0 -0
  23. assets/react_tutorial_001.png +0 -0
  24. assets/react_tutorial_002.png +0 -0
  25. assets/tokenizer.pdf +0 -0
  26. assets/tokenizer.png +0 -0
  27. assets/wanx_colorful_black.png +3 -0
  28. assets/web_demo.gif +3 -0
  29. cli_demo.py +194 -0
  30. eval/EVALUATION.md +83 -0
  31. eval/evaluate_ceval.py +263 -0
  32. eval/evaluate_chat_ceval.py +290 -0
  33. eval/evaluate_chat_gsm8k.py +137 -0
  34. eval/evaluate_chat_humaneval.py +82 -0
  35. eval/evaluate_chat_mmlu.py +207 -0
  36. eval/evaluate_cmmlu.py +271 -0
  37. eval/evaluate_gsm8k.py +110 -0
  38. eval/evaluate_humaneval.py +70 -0
  39. eval/evaluate_mmlu.py +218 -0
  40. eval/evaluate_plugin.py +308 -0
  41. eval/gsm8k_prompt.txt +59 -0
  42. examples/langchain_tooluse.ipynb +708 -0
  43. examples/react_demo.py +288 -0
  44. examples/react_prompt.md +249 -0
  45. examples/tokenizer_showcase.ipynb +441 -0
  46. examples/transformers_agent.md +108 -0
  47. openai_api.py +211 -0
  48. requirements.txt +6 -0
  49. requirements_web_demo.txt +2 -0
  50. tech_memo.md +341 -0
.gitattributes CHANGED
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/cli_demo.gif filter=lfs diff=lfs merge=lfs -text
37
+ assets/hfagent_chat_1.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/hfagent_chat_2.png filter=lfs diff=lfs merge=lfs -text
39
+ assets/hfagent_run.png filter=lfs diff=lfs merge=lfs -text
40
+ assets/openai_api.gif filter=lfs diff=lfs merge=lfs -text
41
+ assets/wanx_colorful_black.png filter=lfs diff=lfs merge=lfs -text
42
+ assets/web_demo.gif filter=lfs diff=lfs merge=lfs -text
.github/ISSUE_TEMPLATE/bug_report.yaml ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: 🐞 Bug
2
+ description: 提交错误报告 | File a bug/issue
3
+ title: "[BUG] <title>"
4
+ labels: []
5
+ body:
6
+ - type: checkboxes
7
+ attributes:
8
+ label: 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
9
+ description: |
10
+ 请先搜索您遇到的错误是否在已有的issues或讨论中提到过。
11
+ Please search to see if an issue / discussion already exists for the bug you encountered.
12
+ [Issues](https://github.com/QwenLM/Qwen-7B/issues)
13
+ [Discussions](https://github.com/QwenLM/Qwen-7B/discussions)
14
+ options:
15
+ - label: 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
16
+ required: true
17
+ - type: checkboxes
18
+ attributes:
19
+ label: 该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
20
+ description: |
21
+ 请先搜索您遇到的错误是否已在FAQ中有相关解答。
22
+ Please search to see if an answer already exists in FAQ for the bug you encountered.
23
+ [FAQ-en](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ.md)
24
+ [FAQ-zh](https://github.com/QwenLM/Qwen-7B/blob/main/FAQ_zh.md)
25
+ options:
26
+ - label: 我已经搜索过FAQ | I have searched FAQ
27
+ required: true
28
+ - type: textarea
29
+ attributes:
30
+ label: 当前行为 | Current Behavior
31
+ description: |
32
+ 准确描述遇到的行为。
33
+ A concise description of what you're experiencing.
34
+ validations:
35
+ required: false
36
+ - type: textarea
37
+ attributes:
38
+ label: 期望行为 | Expected Behavior
39
+ description: |
40
+ 准确描述预期的行为。
41
+ A concise description of what you expected to happen.
42
+ validations:
43
+ required: false
44
+ - type: textarea
45
+ attributes:
46
+ label: 复现方法 | Steps To Reproduce
47
+ description: |
48
+ 复现当前行为的详细步骤。
49
+ Steps to reproduce the behavior.
50
+ placeholder: |
51
+ 1. In this environment...
52
+ 2. With this config...
53
+ 3. Run '...'
54
+ 4. See error...
55
+ validations:
56
+ required: false
57
+ - type: textarea
58
+ attributes:
59
+ label: 运行环境 | Environment
60
+ description: |
61
+ examples:
62
+ - **OS**: Ubuntu 20.04
63
+ - **Python**: 3.8
64
+ - **Transformers**: 4.31.0
65
+ - **PyTorch**: 2.0.1
66
+ - **CUDA**: 11.4
67
+ value: |
68
+ - OS:
69
+ - Python:
70
+ - Transformers:
71
+ - PyTorch:
72
+ - CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
73
+ render: Markdown
74
+ validations:
75
+ required: false
76
+ - type: textarea
77
+ attributes:
78
+ label: 备注 | Anything else?
79
+ description: |
80
+ 您可以在这里补充其他关于该问题背景信息的描述、链接或引用等。
81
+
82
+ 您可以通过点击高亮此区域然后拖动文件的方式上传图片或日志文件。
83
+
84
+ Links? References? Anything that will give us more context about the issue you are encountering!
85
+
86
+ Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
87
+ validations:
88
+ required: false
.github/ISSUE_TEMPLATE/config.yaml ADDED
@@ -0,0 +1 @@
 
 
1
+ blank_issues_enabled: true
.github/ISSUE_TEMPLATE/feature_request.yaml ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: "💡 Feature Request"
2
+ description: 创建新功能请求 | Create a new ticket for a new feature request
3
+ title: "💡 [REQUEST] - <title>"
4
+ labels: [
5
+ "question"
6
+ ]
7
+ body:
8
+ - type: input
9
+ id: start_date
10
+ attributes:
11
+ label: "起始日期 | Start Date"
12
+ description: |
13
+ 起始开发日期
14
+ Start of development
15
+ placeholder: "month/day/year"
16
+ validations:
17
+ required: false
18
+ - type: textarea
19
+ id: implementation_pr
20
+ attributes:
21
+ label: "实现PR | Implementation PR"
22
+ description: |
23
+ 实现该功能的Pull request
24
+ Pull request used
25
+ placeholder: "#Pull Request ID"
26
+ validations:
27
+ required: false
28
+ - type: textarea
29
+ id: reference_issues
30
+ attributes:
31
+ label: "相关Issues | Reference Issues"
32
+ description: |
33
+ 与该功能相关的issues
34
+ Common issues
35
+ placeholder: "#Issues IDs"
36
+ validations:
37
+ required: false
38
+ - type: textarea
39
+ id: summary
40
+ attributes:
41
+ label: "摘要 | Summary"
42
+ description: |
43
+ 简要描述新功能的特点
44
+ Provide a brief explanation of the feature
45
+ placeholder: |
46
+ Describe in a few lines your feature request
47
+ validations:
48
+ required: true
49
+ - type: textarea
50
+ id: basic_example
51
+ attributes:
52
+ label: "基本示例 | Basic Example"
53
+ description: Indicate here some basic examples of your feature.
54
+ placeholder: A few specific words about your feature request.
55
+ validations:
56
+ required: true
57
+ - type: textarea
58
+ id: drawbacks
59
+ attributes:
60
+ label: "缺陷 | Drawbacks"
61
+ description: |
62
+ 该新功能有哪些缺陷/可能造成哪些影响?
63
+ What are the drawbacks/impacts of your feature request ?
64
+ placeholder: |
65
+ Identify the drawbacks and impacts while being neutral on your feature request
66
+ validations:
67
+ required: true
68
+ - type: textarea
69
+ id: unresolved_question
70
+ attributes:
71
+ label: "未解决问题 | Unresolved questions"
72
+ description: |
73
+ 有哪些尚未解决的问题?
74
+ What questions still remain unresolved ?
75
+ placeholder: |
76
+ Identify any unresolved issues.
77
+ validations:
78
+ required: false
.gitignore ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.so
3
+ build
4
+ .coverage_*
5
+ *.egg-info
6
+ *~
7
+ .vscode/
8
+ .idea/
9
+ .DS_Store
10
+
11
+ /private/
FAQ.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FAQ
2
+
3
+ ## Installation & Environment
4
+
5
+ #### Failure in installing flash attention
6
+
7
+ Flash attention is an option for accelerating training and inference. Only NVIDIA GPUs of Turing, Ampere, Ada, and Hopper architecture, e.g., H100, A100, RTX 3090, T4, RTX 2080, can support flash attention. You can use our models without installing it.
8
+
9
+ #### Which version of transformers should I use?
10
+
11
+ 4.31.0 is preferred.
12
+
13
+ #### I downloaded the codes and checkpoints but I can't load the model locally. What should I do?
14
+
15
+ Please check if you have updated the code to the latest, and correctly downloaded all the sharded checkpoint files.
16
+
17
+ #### `qwen.tiktoken` is not found. What is it?
18
+
19
+ This is the merge file of the tokenizer. You have to download it. Note that if you just git clone the repo without [git-lfs](https://git-lfs.com), you cannot download this file.
20
+
21
+ #### transformers_stream_generator/tiktoken/accelerate not found
22
+
23
+ Run the command `pip install -r requirements.txt`. You can find the file at [https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt).
24
+ <br><br>
25
+
26
+
27
+
28
+ ## Demo & Inference
29
+
30
+ #### Is there any demo? CLI demo and Web UI demo?
31
+
32
+ Yes, see `web_demo.py` for web demo and `cli_demo.py` for CLI demo. See README for more information.
33
+
34
+
35
+
36
+ #### Can I use CPU only?
37
+
38
+ Yes, run `python cli_demo.py --cpu_only` will load the model and inference on CPU only.
39
+
40
+ #### Can Qwen support streaming?
41
+
42
+ Yes. See the function `chat_stream` in `modeling_qwen.py`.
43
+
44
+ #### Gibberish in result when using chat_stream().
45
+
46
+ This is because tokens represent bytes and a single token may be a meaningless string. We have updated the default setting of our tokenizer to avoid such decoding results. Please update the code to the latest version.
47
+
48
+ #### It seems that the generation is not related to the instruction...
49
+
50
+ Please check if you are loading Qwen-7B-Chat instead of Qwen-7B. Qwen-7B is the base model without alignment, which behaves differently from the SFT/Chat model.
51
+
52
+ #### Is quantization supported?
53
+
54
+ Yes, the quantization is supported by `bitsandbytes`. We are working on an improved version and will release the quantized model checkpoints.
55
+
56
+ #### Errors in running quantized models: `importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
57
+
58
+ For Linux users,running `pip install bitsandbytes` directly can solve the problem. For Windows users, you can run `python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`·
59
+
60
+ #### Slow when processing long sequences
61
+
62
+ We solved this problem. Updating the code to the latest version can help.
63
+
64
+ #### Unsatisfactory performance in processing long sequences
65
+
66
+ Please ensure that NTK is applied. `use_dynamc_ntk` and `use_logn_attn` in `config.json` should be set to `true` (`true` by default).
67
+ <br><br>
68
+
69
+
70
+
71
+ ## Finetuning
72
+
73
+ #### Can Qwen support SFT or even RLHF?
74
+
75
+ We do not provide finetuning or RLHF codes for now. However, some projects have supported finetuning, see [FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat)), [Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly)), [**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning)), etc. We will soon update the relevant codes.
76
+ <br><br>
77
+
78
+
79
+
80
+ ## Tokenizer
81
+
82
+ #### bos_id/eos_id/pad_id not found
83
+
84
+ In our training, we only use `<|endoftext|>` as the separator and padding token. You can set bos_id, eos_id, and pad_id to tokenizer.eod_id. Learn more about our tokenizer from our documents about the tokenizer.
85
+
FAQ_zh.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FAQ
2
+
3
+ ## 安装&环境
4
+
5
+ #### flash attention 安装失败
6
+
7
+ flash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡(如H100、A100、RTX 3090、T4、RTX 2080),您可以在不安装flash attention的情况下正常使用模型进行推理。
8
+
9
+ #### 我应该用哪个transformers版本?
10
+
11
+ 建议使用4.31.0。
12
+
13
+ #### 我把模型和代码下到本地,按照教程无法使用,该怎么办?
14
+
15
+ 答:别着急,先检查你的代码是不是更新到最新版本,然后确认你是否完整地将模型checkpoint下到本地。
16
+
17
+ #### `qwen.tiktoken`这个文件找不到,怎么办?
18
+
19
+ 这个是我们的tokenizer的merge文件,你必须下载它才能使用我们的tokenizer。注意,如果你使用git clone却没有使用git-lfs,这个文件不会被下载。如果你不了解git-lfs,可点击[官网](https://git-lfs.com/)了解。
20
+
21
+ #### transformers_stream_generator/tiktoken/accelerate,这几个库提示找不到,怎么办?
22
+
23
+ 运行如下命令:`pip install -r requirements.txt`。相关依赖库在[https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt](https://github.com/QwenLM/Qwen-7B/blob/main/requirements.txt) 可以找到。
24
+ <br><br>
25
+
26
+
27
+ ## Demo & 推理
28
+
29
+ #### 是否提供Demo?CLI Demo及Web UI Demo?
30
+
31
+ `web_demo.py`和`cli_demo.py`分别提供了Web UI以及CLI的Demo。请查看README相关内容了解更多。
32
+
33
+ #### 我没有GPU,只用CPU运行CLI demo可以吗?
34
+
35
+ 可以的,运行`python cli_demo.py --cpu_only`命令即可将模型读取到CPU并使用CPU进行推理。
36
+
37
+ #### Qwen支持流式推理吗?
38
+
39
+ Qwen当前支持流式推理。见位于`modeling_qwen.py`的`chat_stream`函数。
40
+
41
+ #### 使用`chat_stream()`生成混乱的内容及乱码,为什么?
42
+
43
+ 这是由于模型生成过程中输出的部分token需要与后续token一起解码才能输出正常文本,单个token解码结果是无意义字符串,我们已经更新了tokenizer解码时的默认设置,避免这些字符串在生成结果中出现,如果仍有类似问题请更新模型至最新版本。
44
+
45
+ #### 模型的输出看起来与输入无关/没有遵循指令/看起来呆呆的
46
+
47
+ 请检查是否加载的是Qwen-7B-Chat模型进行推理,Qwen-7B模型是未经align的预训练基模型,不期望具备响应用户指令的能力。我们在模型最新版本已经对`chat`及`chat_stream`接口内进行了检查,避免您误将预训练模型作为SFT/Chat模型使用。
48
+
49
+ #### 是否有量化版本模型
50
+
51
+ 目前Qwen支持基于`bitsandbytes`的8-bit和4-bit的量化推理。后续我们将进一步更新提供更加高效的量化推理实现,并提供对应的量化模型。
52
+
53
+ #### 运行量化推理报错:`importlib.metadata.PackageNotFoundError: No package metadata was found for bitsandbytes`
54
+
55
+ 对于linux 用户,直接`pip install bitsandbytes`即可。对于windows用户,可以 运行`python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui`。
56
+
57
+ #### 生成序列较长后速度显著变慢
58
+
59
+ 这一问题已经在最新版本中修复。请更新到最新代码。
60
+
61
+ #### 处理长序列时效果有问题
62
+
63
+ 请确认是否开启ntk。若要启用这些技巧,请将`config.json`里的`use_dynamc_ntk`和`use_logn_attn`设置为`true`。最新代码默认为`true`。
64
+ <br><br>
65
+
66
+
67
+ ## 微调
68
+
69
+ #### 当前是否支持SFT和RLHF?
70
+
71
+ 我们目前未提供SFT和RLHF代码。当前有多个外部项目已实现支持,如[FastChat](**[https://github.com/lm-sys/FastChat](https://github.com/lm-sys/FastChat))、[Firefly]([https://github.com/yangjianxin1/Firefly](https://github.com/yangjianxin1/Firefly))、[**LLaMA Efficient Tuning**]([https://github.com/hiyouga/LLaMA-Efficient-Tuning](https://github.com/hiyouga/LLaMA-Efficient-Tuning))等。我们会尽快更新这部分代码和说明。
72
+ <br><br>
73
+
74
+
75
+ ## Tokenizer
76
+
77
+ #### bos_id/eos_id/pad_id,这些token id不存在,为什么?
78
+
79
+ 在训练过程中,我们仅使用<|endoftext|>这一token作为sample/document之间的分隔符及padding位置占位符,你可以将bos_id, eos_id, pad_id均指向tokenizer.eod_id。请阅读我们关于tokenizer的文档,了解如何设置这些id。
80
+
LICENSE ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Tongyi Qianwen LICENSE AGREEMENT
2
+
3
+ Tongyi Qianwen Release Date: August 3, 2023
4
+
5
+ By clicking to agree or by using or distributing any portion or element of the Tongyi Qianwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
6
+
7
+ 1. Definitions
8
+ a. This Tongyi Qianwen LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
9
+ b. "We"(or "Us") shall mean Alibaba Cloud.
10
+ c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
11
+ d. "Third Parties" shall mean individuals or legal entities that are not under common control with Us or You.
12
+ e. "Tongyi Qianwen" shall mean the large language models (including Qwen-7B model and Qwen-7B-Chat model), and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by Us.
13
+ f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Tongyi Qianwen and Documentation (and any portion thereof) made available under this Agreement.
14
+ g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
15
+ h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation,
16
+ and conversions to other media types.
17
+
18
+ 2. Grant of Rights
19
+ You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
20
+
21
+ 3. Redistribution
22
+ You may reproduce and distribute copies of the Materials or derivative works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
23
+ a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
24
+ b. You shall cause any modified files to carry prominent notices stating that You changed the files;
25
+ c. You shall retain in all copies of the Materials that You distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Tongyi Qianwen is licensed under the Tongyi Qianwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
26
+ d. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such derivative works as a whole, provided Your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
27
+
28
+ 4. Restrictions
29
+ If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
30
+
31
+ 5. Rules of use
32
+ a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
33
+ b. You can not use the Materials or any output therefrom to improve any other large language model (excluding Tongyi Qianwen or derivative works thereof).
34
+
35
+ 6. Intellectual Property
36
+ a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for Us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
37
+ b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of Us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
38
+ c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against Us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licences granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
39
+
40
+ 7. Disclaimer of Warranty and Limitation of Liability
41
+
42
+ a. We are not obligated to support, update, provide training for, or develop any further version of the Tongyi Qianwen Materials or to grant any license thereto.
43
+ b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
44
+ c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
45
+ d. You will defend, indemnify and hold harmless Us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
46
+
47
+ 8. Survival and Termination.
48
+ a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
49
+ b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 7 and 9 shall survive the termination of this Agreement.
50
+
51
+ 9. Governing Law and Jurisdiction.
52
+ a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
53
+ b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
NOTICE ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ------------- LICENSE FOR NVIDIA Megatron-LM code --------------
2
+
3
+ Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
4
+
5
+ Redistribution and use in source and binary forms, with or without
6
+ modification, are permitted provided that the following conditions
7
+ are met:
8
+ * Redistributions of source code must retain the above copyright
9
+ notice, this list of conditions and the following disclaimer.
10
+ * Redistributions in binary form must reproduce the above copyright
11
+ notice, this list of conditions and the following disclaimer in the
12
+ documentation and/or other materials provided with the distribution.
13
+ * Neither the name of NVIDIA CORPORATION nor the names of its
14
+ contributors may be used to endorse or promote products derived
15
+ from this software without specific prior written permission.
16
+
17
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
18
+ EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
19
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
20
+ PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
21
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
22
+ EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
23
+ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
24
+ PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
25
+ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28
+
29
+
30
+ ------------- LICENSE FOR OpenAI tiktoken code --------------
31
+
32
+ MIT License
33
+
34
+ Copyright (c) 2022 OpenAI, Shantanu Jain
35
+
36
+ Permission is hereby granted, free of charge, to any person obtaining a copy
37
+ of this software and associated documentation files (the "Software"), to deal
38
+ in the Software without restriction, including without limitation the rights
39
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
40
+ copies of the Software, and to permit persons to whom the Software is
41
+ furnished to do so, subject to the following conditions:
42
+
43
+ The above copyright notice and this permission notice shall be included in all
44
+ copies or substantial portions of the Software.
45
+
46
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
47
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
48
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
49
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
50
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
51
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
52
+ SOFTWARE.
README.md CHANGED
@@ -1,12 +1,437 @@
1
  ---
2
- title: Qwen 7B Main
3
- emoji: 👀
4
- colorFrom: yellow
5
- colorTo: green
6
  sdk: gradio
7
  sdk_version: 3.40.1
8
- app_file: app.py
9
- pinned: false
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: Qwen-7B-main
3
+ app_file: web_demo.py
 
 
4
  sdk: gradio
5
  sdk_version: 3.40.1
 
 
6
  ---
7
+ <br>
8
+
9
+ <p align="center">
10
+ <img src="assets/logo.jpg" width="400"/>
11
+ <p>
12
+ <br>
13
+
14
+ <p align="center">
15
+ Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp | Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp | &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
16
+ </p>
17
+ <br>
18
+
19
+ <p align="center">
20
+ <a href="README_CN.md">中文</a>&nbsp | &nbspEnglish&nbsp | &nbsp<a href="README_JA.md">日本語</a>
21
+ </p>
22
+ <br><br>
23
+
24
+ We opensource **Qwen-7B** and **Qwen-7B-Chat** on both **🤖 ModelScope** and **🤗 Hugging Face** (Click the logos on top to the repos with codes and checkpoints). This repo includes the brief introduction to Qwen-7B, the usage guidance, and also a technical memo [link](tech_memo.md) that provides more information.
25
+
26
+ Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include:
27
+
28
+ 1. **Trained with high-quality pretraining data**. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data.
29
+ 2. **Strong performance**. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.
30
+ 3. **Better support of languages**. Our tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.
31
+ 4. **Support of 8K Context Length**. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.
32
+ 5. **Support of Plugins**. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.
33
+
34
+ The following sections include information that you might find it helpful. Specifically, we advise you to read the FAQ section before you launch issues.
35
+
36
+ ## News
37
+
38
+ * 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope and Hugging Face. We also provide a technical memo for more details about the model, including training details and model performance.
39
+
40
+ ## Performance
41
+
42
+ In general, Qwen-7B outperforms the baseline models of a similar model size, and even outperforms larger models of around 13B parameters, on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, HumanEval, and WMT22, CMMLU, etc., which evaluate the models' capabilities on natural language understanding, mathematic problem solving, coding, etc. See the results below.
43
+
44
+ | Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
45
+ | :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
46
+ | LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
47
+ | LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
48
+ | Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
49
+ | ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
50
+ | InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
51
+ | Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
52
+ | LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
53
+ | LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
54
+ | ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
55
+ | **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
56
+
57
+ <p align="center">
58
+ <img src="assets/performance.png" width="1000"/>
59
+ <p>
60
+ <br>
61
+
62
+ Additionally, according to the third-party evaluation of large language models, conducted by [OpenCompass](https://opencompass.org.cn/leaderboard-llm), Qwen-7B and Qwen-7B-Chat are the top 7B-parameter models. This evaluation consists of a large amount of public benchmarks for the evaluation of language understanding and generation, coding, mathematics, reasoning, etc.
63
+
64
+ For more experimental results (detailed model performance on more benchmark datasets) and details, please refer to our technical memo by clicking [here](tech_memo.md).
65
+
66
+ ## Requirements
67
+
68
+ * python 3.8 and above
69
+ * pytorch 1.12 and above, 2.0 and above are recommended
70
+ * CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.)
71
+
72
+ ## Quickstart
73
+
74
+ Below, we provide simple examples to show how to use Qwen-7B with 🤖 ModelScope and 🤗 Transformers.
75
+
76
+ Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
77
+
78
+ ```bash
79
+ pip install -r requirements.txt
80
+ ```
81
+
82
+ If your device supports fp16 or bf16, we recommend installing [flash-attention](https://github.com/Dao-AILab/flash-attention) for higher efficiency and lower memory usage. (**flash-attention is optional and the project can run normally without installing it**)
83
+
84
+ ```bash
85
+ git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
86
+ cd flash-attention && pip install .
87
+ # Below are optional. Installing them might be slow.
88
+ # pip install csrc/layer_norm
89
+ # pip install csrc/rotary
90
+ ```
91
+
92
+ Now you can start with ModelScope or Transformers.
93
+
94
+ #### 🤗 Transformers
95
+
96
+ To use Qwen-7B-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, **please make sure that you are using the latest code.**
97
+
98
+ ```python
99
+ from transformers import AutoModelForCausalLM, AutoTokenizer
100
+ from transformers.generation import GenerationConfig
101
+
102
+ # Note: The default behavior now has injection attack prevention off.
103
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
104
+
105
+ # use bf16
106
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
107
+ # use fp16
108
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
109
+ # use cpu only
110
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
111
+ # use auto mode, automatically select precision based on the device.
112
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
113
+
114
+ # Specify hyperparameters for generation
115
+ model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
116
+
117
+ # 第一轮对话 1st dialogue turn
118
+ response, history = model.chat(tokenizer, "你好", history=None)
119
+ print(response)
120
+ # 你好!很高兴为你提供帮助。
121
+
122
+ # 第二轮对话 2nd dialogue turn
123
+ response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
124
+ print(response)
125
+ # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
126
+ # 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
127
+ # 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
128
+ # 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
129
+ # 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
130
+ # 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
131
+
132
+ # 第三轮对话 3rd dialogue turn
133
+ response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
134
+ print(response)
135
+ # 《奋斗创业:一个年轻人的成功之路》
136
+ ```
137
+
138
+ Running Qwen-7B pretrained base model is also simple.
139
+
140
+ <details>
141
+ <summary>Running Qwen-7B</summary>
142
+
143
+ ```python
144
+ from transformers import AutoModelForCausalLM, AutoTokenizer
145
+ from transformers.generation import GenerationConfig
146
+
147
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
148
+ # use bf16
149
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
150
+ # use fp16
151
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
152
+ # use cpu only
153
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
154
+ # use auto mode, automatically select precision based on the device.
155
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
156
+
157
+ # Specify hyperparameters for generation
158
+ model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
159
+
160
+ inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt')
161
+ inputs = inputs.to(model.device)
162
+ pred = model.generate(**inputs)
163
+ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
164
+ # 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
165
+ ```
166
+
167
+ </details>
168
+
169
+ #### 🤖 ModelScope
170
+
171
+ ModelScope is an opensource platform for Model-as-a-Service (MaaS), which provides flexible and cost-effective model service to AI developers. Similarly, you can run the models with ModelScope as shown below:
172
+
173
+ ```python
174
+ import os
175
+ from modelscope.pipelines import pipeline
176
+ from modelscope.utils.constant import Tasks
177
+ from modelscope import snapshot_download
178
+
179
+ model_id = 'QWen/qwen-7b-chat'
180
+ revision = 'v1.0.0'
181
+
182
+ model_dir = snapshot_download(model_id, revision)
183
+
184
+ pipe = pipeline(
185
+ task=Tasks.chat, model=model_dir, device_map='auto')
186
+ history = None
187
+
188
+ text = '浙江的省会在哪里?'
189
+ results = pipe(text, history=history)
190
+ response, history = results['response'], results['history']
191
+ print(f'Response: {response}')
192
+ text = '它有什么好玩的地方呢?'
193
+ results = pipe(text, history=history)
194
+ response, history = results['response'], results['history']
195
+ print(f'Response: {response}')
196
+ ```
197
+
198
+ ## Tokenizer
199
+
200
+ Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the [documentation](tokenization_note.md).
201
+
202
+ ## Quantization
203
+
204
+ We provide examples to show how to load models in `NF4` and `Int8`. For starters, make sure you have implemented `bitsandbytes`. Note that the requirements for `bitsandbytes` are:
205
+
206
+ ```
207
+ **Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
208
+ ```
209
+
210
+ Then run the following command to install `bitsandbytes`:
211
+
212
+ ```
213
+ pip install bitsandbytes
214
+ ```
215
+
216
+ Windows users should find another option, which might be [bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels).
217
+
218
+ Then you only need to add your quantization configuration to `AutoModelForCausalLM.from_pretrained`. See the example below:
219
+
220
+ ```python
221
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
222
+
223
+ # quantization configuration for NF4 (4 bits)
224
+ quantization_config = BitsAndBytesConfig(
225
+ load_in_4bit=True,
226
+ bnb_4bit_quant_type='nf4',
227
+ bnb_4bit_compute_dtype=torch.bfloat16
228
+ )
229
+
230
+ # quantization configuration for Int8 (8 bits)
231
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
232
+
233
+ model = AutoModelForCausalLM.from_pretrained(
234
+ args.checkpoint_path,
235
+ device_map="cuda:0",
236
+ quantization_config=quantization_config,
237
+ max_memory=max_memory,
238
+ trust_remote_code=True,
239
+ ).eval()
240
+ ```
241
+
242
+ With this method, it is available to load Qwen-7B in `NF4` and `Int8`, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly reduces memory costs.
243
+
244
+ | Precision | MMLU | GPU Memory for Loading Model |
245
+ | ----------- | :------: | :---------------------------: |
246
+ | BF16 | 56.7 | 16.38G |
247
+ | Int8 | 52.8 | 10.44G |
248
+ | NF4 | 48.9 | 7.79G |
249
+
250
+ Note: The GPU memory usage profiling in the above table is performed on single A100-SXM4-80G GPU, PyTorch 2.0.1 and CUDA 11.8, with flash attention used.
251
+
252
+ ## Inference Efficiency
253
+
254
+ ### Inference Speed
255
+
256
+ We measured the average inference speed of generating 2K tokens under BF16 precision and Int8 or NF4 quantization levels, respectively.
257
+
258
+ | Quantization Level | Inference Speed with flash_attn (tokens/s) | Inference Speed w/o flash_attn (tokens/s) |
259
+ | ---------------------- | :----------------------------------------: | :---------------------------------------: |
260
+ | BF16 (no quantization) | 30.06 | 27.55 |
261
+ | Int8 (bnb) | 7.94 | 7.86 |
262
+ | NF4 (bnb) | 21.43 | 20.37 |
263
+
264
+ In detail, the setting of profiling is generating 2048 new tokens with 1 context token. The profiling runs on single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 2048 tokens.
265
+
266
+ ### GPU Memory Usage
267
+
268
+ We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int8/NF4 quantization levels, respectively. The results are shown below.
269
+
270
+ When using flash attention, the memory usage is:
271
+
272
+ | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
273
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
274
+ | BF16 | 18.11GB | 23.52GB |
275
+ | Int8 | 12.17GB | 17.60GB |
276
+ | NF4 | 9.52GB | 14.93GB |
277
+
278
+ When not using flash attention, the memory usage is:
279
+
280
+ | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
281
+ | ------------------ | :---------------------------------: | :-----------------------------------: |
282
+ | BF16 | 18.11GB | 24.40GB |
283
+ | Int8 | 12.18GB | 18.47GB |
284
+ | NF4 | 9.52GB | 15.81GB |
285
+
286
+ The above speed and memory profiling are conducted using [this script](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py).
287
+
288
+ ## Demo
289
+
290
+
291
+ ### Web UI
292
+
293
+ We provide code for users to build a web UI demo (thanks to @wysaid). Before you start, make sure you install the following packages:
294
+
295
+ ```
296
+ pip install -r requirements_web_demo.txt
297
+ ```
298
+
299
+ Then run the command below and click on the generated link:
300
+
301
+ ```
302
+ python web_demo.py
303
+ ```
304
+
305
+ <p align="center">
306
+ <br>
307
+ <img src="assets/web_demo.gif" width="600" />
308
+ <br>
309
+ <p>
310
+
311
+ ### CLI Demo
312
+
313
+ We provide a CLI demo example in `cli_demo.py`, which supports streaming output for the generation. Users can interact with Qwen-7B-Chat by inputting prompts, and the model returns model outputs in the streaming mode. Run the command below:
314
+
315
+ ```
316
+ python cli_demo.py
317
+ ```
318
+
319
+ <p align="center">
320
+ <br>
321
+ <img src="assets/cli_demo.gif" width="600" />
322
+ <br>
323
+ <p>
324
+
325
+ ## API
326
+
327
+ We provide methods to deploy local API based on OpenAI API (thanks to @hanpenggit). Before you start, install the required packages:
328
+
329
+ ```bash
330
+ pip install fastapi uvicorn openai pydantic sse_starlette
331
+ ```
332
+
333
+ Then run the command to deploy your API:
334
+
335
+ ```bash
336
+ python openai_api.py
337
+ ```
338
+
339
+ You can change your arguments, e.g., `-c` for checkpoint name or path, `--cpu-only` for CPU deployment, etc. If you meet problems launching your API deployment, updating the packages to the latest version can probably solve them.
340
+
341
+ Using the API is also simple. See the example below:
342
+
343
+ ```python
344
+ import openai
345
+ openai.api_base = "http://localhost:8000/v1"
346
+ openai.api_key = "none"
347
+
348
+ # create a request activating streaming response
349
+ for chunk in openai.ChatCompletion.create(
350
+ model="Qwen-7B",
351
+ messages=[
352
+ {"role": "user", "content": "你好"}
353
+ ],
354
+ stream=True
355
+ ):
356
+ if hasattr(chunk.choices[0].delta, "content"):
357
+ print(chunk.choices[0].delta.content, end="", flush=True)
358
+
359
+ # create a request not activating streaming response
360
+ response = openai.ChatCompletion.create(
361
+ model="Qwen-7B",
362
+ messages=[
363
+ {"role": "user", "content": "你好"}
364
+ ],
365
+ stream=False
366
+ )
367
+ print(response.choices[0].message.content)
368
+ ```
369
+
370
+ <p align="center">
371
+ <br>
372
+ <img src="assets/openai_api.gif" width="600" />
373
+ <br>
374
+ <p>
375
+
376
+ ## Tool Usage
377
+
378
+ Qwen-7B-Chat is specifically optimized for tool usage, including API, database, models, etc., so that users can build their own Qwen-7B-based LangChain, Agent, and Code Interpreter. In our evaluation [benchmark](eval/EVALUATION.md) for assessing tool usage capabilities, we find that Qwen-7B reaches stable performance.
379
+
380
+ | Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
381
+ |:------------|:----------------------:|:----------------------:|:----------------------:|
382
+ | GPT-4 | 95% | **0.90** | 15% |
383
+ | GPT-3.5 | 85% | 0.88 | 75% |
384
+ | **Qwen-7B** | **99%** | 0.89 | **9.7%** |
385
+
386
+ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md). The use of tools can enable the model to better perform tasks.
387
+
388
+ Additionally, we provide experimental results to show its capabilities of playing as an agent. See [Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) for more information. Its performance on the run-mode benchmark provided by Hugging Face is as follows:
389
+
390
+ | Model | Tool Selection↑ | Tool Used↑ | Code↑ |
391
+ |:---------------|:---------------:|:-----------:|:---------:|
392
+ |GPT-4 | **100** | **100** | **97.41** |
393
+ |GPT-3.5 | 95.37 | 96.30 | 87.04 |
394
+ |StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
395
+ | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
396
+
397
+ ## Long-Context Understanding
398
+
399
+ To extend the context length and break the bottleneck of training sequence length, we introduce several techniques, including NTK-aware interpolation, window attention, and LogN attention scaling, to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation and find that Qwen-7B can reach outstanding performance in the scenario of long context. Results are demonstrated below:
400
+
401
+ <table>
402
+ <tr>
403
+ <th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th>
404
+ </tr>
405
+ <tr>
406
+ <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th>
407
+ </tr>
408
+ <tr>
409
+ <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td>
410
+ </tr>
411
+ <tr>
412
+ <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td>
413
+ </tr>
414
+ <tr>
415
+ <td>+ dynamic_ntk + logn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center">3.56</td><td align="center">4.62</td>
416
+ </tr>
417
+ <tr>
418
+ <td>+ dynamic_ntk + logn + window_attn</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center"><b>3.58</b></td><td align="center"><b>3.49</b></td><td align="center"><b>4.32</b></td>
419
+ </tr>
420
+ </table>
421
+
422
+ ## Reproduction
423
+
424
+ For your reproduction of the model performance on benchmark datasets, we provide scripts for you to reproduce the results. Check [eval/EVALUATION.md](eval/EVALUATION.md) for more information. Note that the reproduction may lead to slight differences from our reported results.
425
+
426
+ ## FAQ
427
+
428
+ If you meet problems, please refer to [FAQ](FAQ.md) and the issues first to search a solution before you launch a new issue.
429
+
430
+ ## License Agreement
431
+
432
+ Researchers and developers are free to use the codes and model weights of both Qwen-7B and Qwen-7B-Chat. We also allow their commercial use. Check our license at [LICENSE](LICENSE) for more details. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
433
+
434
+ ## Contact Us
435
+
436
+ If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.
437
 
 
README_CN.md ADDED
@@ -0,0 +1,436 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <br>
2
+
3
+ <p align="center">
4
+ <img src="assets/logo.jpg" width="400"/>
5
+ <p>
6
+ <br>
7
+
8
+ <p align="center">
9
+ Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 <a> | <a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp | Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 <a>| <a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp | &nbsp<a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>&nbsp | &nbsp<a href="https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md">Report</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/9bjvspyu">Discord</a>
10
+ </p>
11
+ <br>
12
+
13
+ <p align="center">
14
+ 中文</a>&nbsp | &nbsp<a href="README.md">English</a>&nbsp | &nbsp<a href="README_JA.md">日本語</a>
15
+ </p>
16
+ <br><br>
17
+
18
+ 我们在🤖 **ModelScope**以及🤗 **Hugging Face**均开源了**Qwen-7B**系列模型。请在本文档顶部点击相关链接查看仓库信息。本仓库主要包括Qwen-7B的简介、使用指南、技术备忘等内容。想了解更多关于模型的信息,请点击[链接](tech_memo.md)查看我们的技术备忘录。
19
+
20
+ 通义千问-7B(Qwen-7B) 是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。Qwen-7B系列模型的特点包括:
21
+
22
+ 1. **大规模高质量预训练数据**:我们使用了超过2.2万亿token的自建大规模预训练数据集进行语言模型的预训练。数据集包括文本和代码等多种数据类型,覆盖通用领域和专业领域。
23
+ 2. **优秀的模型性能**:相比同规模的开源模型,Qwen-7B在多个评测数据集上具有显著优势,甚至超出12-13B等更大规模的模型。评测评估的能力范围包括自然语言理解与生成、数学运算解题、代码生成等。
24
+ 3. **更好地支持多语言**:基于更大词表的分词器在分词上更高效,同时它对其他语言表现更加友好。用户可以在Qwen-7B的基础上更方便地训练特定语言的7B语言模型。
25
+ 4. **8K的上下文长度**:Qwen-7B及Qwen-7B-Chat均能支持8K的上下文长度, 允许用户输入更长的prompt。
26
+ 5. **支持插件调用**:Qwen-7B-Chat针对插件调用相关的对齐数据做了特定优化,当前模型能有效调用插件以及升级为Agent。
27
+
28
+ 以下章节的信息可能对你有帮助,建议阅读。如果你在使用过程遇到问题,建议先查询FAQ,如仍无法解决再提交issue。
29
+
30
+ ## 新闻
31
+
32
+ * 2023年8月3日 在魔搭社区(ModelScope)和Hugging Face同步推出Qwen-7B和Qwen-7B-Chat模型。同时,我们发布了技术备忘录,介绍了相关的训练细节和模型表现。
33
+
34
+ ## 评测表现
35
+
36
+ Qwen-7B在多个全面评估自然语言理解与生成、数学运算解题、代码生成等能力的评测数据集上,包括MMLU、C-Eval、GSM8K、HumanEval、WMT22、CMMLU等,均超出了同规模大语言模型的表现,甚至超出了如12-13B参数等更大规模的语言模型。
37
+
38
+ | Model | MMLU | C-Eval | GSM8K | HumanEval | WMT22 (en-zh) | CMMLU |
39
+ | :---------------- | :------------: | :------------: | :------------: | :------------: | :------------: |:------------: |
40
+ | LLaMA-7B | 35.1 | - | 11.0 | 10.5 | 8.7 | - |
41
+ | LLaMA 2-7B | 45.3 | - | 14.6 | 12.8 | 17.9 | - |
42
+ | Baichuan-7B | 42.3 | 42.8 | 9.7 | 9.2 | 26.6 | 44.4 |
43
+ | ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 9.2 | - | 48.8 |
44
+ | InternLM-7B | 51.0 | 52.8 | 31.2 | 10.4 | 14.8 | - |
45
+ | Baichuan-13B | 51.6 | 53.6 | 26.6 | 12.8 | 30.0 | 55.8 |
46
+ | LLaMA-13B | 46.9 | 35.5 | 17.8 | 15.8 | 12.0 | - |
47
+ | LLaMA 2-13B | 54.8 | - | 28.7 | 18.3 | 24.2 | - |
48
+ | ChatGLM2-12B | 56.2 | **61.6** | 40.9 | - | - | - |
49
+ | **Qwen-7B** | **56.7** | 59.6 | **51.6** | **24.4** | **30.6** | **58.8** |
50
+
51
+ <p align="center">
52
+ <img src="assets/performance.png" width="1000"/>
53
+ <p>
54
+ <br>
55
+
56
+ 此外,根据[OpenCompass](https://opencompass.org.cn/leaderboard-llm)进行的大型语言模型第三方评估,Qwen-7B 和 Qwen-7B-Chat 是其中表现最优的7B参数模型��该评估由大量公开基准组成,用于评估语言理解和生成、代码生成、数学、推理等。
57
+
58
+ 更多的实验结果和细节请查看我们的技术备忘录。点击[这里](tech_memo.md)。
59
+
60
+ ## 要求
61
+
62
+ * python 3.8及以上版本
63
+ * pytorch 1.12及以上版本,推荐2.0及以上版本
64
+ * 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项)
65
+
66
+ ## 快速使用
67
+
68
+ 我们提供简单的示例来说明如何利用🤖 ModelScope和🤗 Transformers快速使用Qwen-7B和Qwen-7B-Chat。
69
+
70
+ 在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。
71
+
72
+ ```bash
73
+ pip install -r requirements.txt
74
+ ```
75
+
76
+ 如果你的显卡支持fp16或bf16精度,我们还推荐安装[flash-attention](https://github.com/Dao-AILab/flash-attention)来提高你的运行效率以及降低显存占用。(**flash-attention只是可选项,不安装也可正常运行该项目**)
77
+
78
+ ```bash
79
+ git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
80
+ cd flash-attention && pip install .
81
+ # 下方安装可选,安装可能比较缓慢。
82
+ # pip install csrc/layer_norm
83
+ # pip install csrc/rotary
84
+ ```
85
+
86
+ 接下来你可以开始使用Transformers或者ModelScope来使用我们的模型。
87
+
88
+ #### 🤗 Transformers
89
+
90
+ 如希望使用Qwen-7B-chat进行推理,所需要写的只是如下所示的数行代码。**请确保你使用的是最新代码。**
91
+
92
+ ```python
93
+ from transformers import AutoModelForCausalLM, AutoTokenizer
94
+ from transformers.generation import GenerationConfig
95
+
96
+ # 请注意:分词器默认行为已更改为默认关闭特殊token攻击防护。
97
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
98
+
99
+ # 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
100
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
101
+ # 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
102
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
103
+ # 使用CPU进行推理,需要约32GB内存
104
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
105
+ # 默认使用自动模式,根据设备自动选择精度
106
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
107
+
108
+ # 可指定不同的生成长度、top_p等相关超参
109
+ model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
110
+
111
+ # 第一轮对话 1st dialogue turn
112
+ response, history = model.chat(tokenizer, "你好", history=None)
113
+ print(response)
114
+ # 你好!很高兴为你提供帮助。
115
+
116
+ # 第二轮对话 2nd dialogue turn
117
+ response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
118
+ print(response)
119
+ # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
120
+ # 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。
121
+ # 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。
122
+ # 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。
123
+ # 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。
124
+ # 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。
125
+
126
+ # 第三轮对话 3rd dialogue turn
127
+ response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
128
+ print(response)
129
+ # 《奋斗创业:一个年轻人的成功之路》
130
+ ```
131
+
132
+ 运行Qwen-7B同样非常简单。
133
+
134
+ <details>
135
+ <summary>运行Qwen-7B</summary>
136
+
137
+ ```python
138
+ from transformers import AutoModelForCausalLM, AutoTokenizer
139
+ from transformers.generation import GenerationConfig
140
+
141
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
142
+
143
+ # 打开bf16精度,A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
144
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
145
+ # 打开fp16精度,V100、P100、T4等显卡建议启用以节省显存
146
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
147
+ # 使用CPU进行推理,需要约32GB内存
148
+ # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
149
+ # 默认使用自动模式,根据设备自动选择精度
150
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
151
+
152
+ # 可指定不同的生成长度、top_p等相关超参
153
+ model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
154
+
155
+ inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是', return_tensors='pt')
156
+ inputs = inputs.to(model.device)
157
+ pred = model.generate(**inputs)
158
+ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
159
+ # 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
160
+ ```
161
+
162
+ </details>
163
+
164
+ #### 🤖 ModelScope
165
+
166
+ 魔搭(ModelScope)是开源的模型即服务共享平台,为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品。使用ModelScope同样非常简单,代码如下所示:
167
+
168
+ ```python
169
+ import os
170
+ from modelscope.pipelines import pipeline
171
+ from modelscope.utils.constant import Tasks
172
+ from modelscope import snapshot_download
173
+
174
+ model_id = 'QWen/qwen-7b-chat'
175
+ revision = 'v1.0.0'
176
+
177
+ model_dir = snapshot_download(model_id, revision)
178
+
179
+ pipe = pipeline(
180
+ task=Tasks.chat, model=model_dir, device_map='auto')
181
+ history = None
182
+
183
+ text = '浙江的省会在哪里?'
184
+ results = pipe(text, history=history)
185
+ response, history = results['response'], results['history']
186
+ print(f'Response: {response}')
187
+ text = '它有什么好玩的地方呢?'
188
+ results = pipe(text, history=history)
189
+ response, history = results['response'], results['history']
190
+ print(f'Response: {response}')
191
+ ```
192
+
193
+ ## Tokenization
194
+
195
+ > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。
196
+
197
+ 基于tiktoken的tokenizer有别于其他分词器,比如sentencepiece tokenizer。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅[文档](tokenization_note_zh.md)。
198
+
199
+ ## 量化
200
+
201
+ 如希望使用更低精度的量化模型,如4比特和8比特的模型,我们提供了简单的示例来说明如何快速使用量化模型。在开始前,确保你已经安装了`bitsandbytes`。请注意,`bitsandbytes`的安装要求是:
202
+
203
+ ```
204
+ **Requirements** Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.
205
+ ```
206
+
207
+ 随后运行如下命令安装`bitsandbytes`:
208
+
209
+ ```
210
+ pip install bitsandbytes
211
+ ```
212
+
213
+ Windows用户需安装特定版本的`bitsandbytes`,可选项包括[bitsandbytes-windows-webui](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。
214
+
215
+ 你只需要在`AutoModelForCausalLM.from_pretrained`中添加你的量化配置,即可使用量化模型。如下所示:
216
+
217
+ ```python
218
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
219
+
220
+ # quantization configuration for NF4 (4 bits)
221
+ quantization_config = BitsAndBytesConfig(
222
+ load_in_4bit=True,
223
+ bnb_4bit_quant_type='nf4',
224
+ bnb_4bit_compute_dtype=torch.bfloat16
225
+ )
226
+
227
+ # quantization configuration for Int8 (8 bits)
228
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
229
+
230
+ model = AutoModelForCausalLM.from_pretrained(
231
+ args.checkpoint_path,
232
+ device_map="cuda:0",
233
+ quantization_config=quantization_config,
234
+ max_memory=max_memory,
235
+ trust_remote_code=True,
236
+ ).eval()
237
+ ```
238
+
239
+ 上述方法可以让我们将模型量化成`NF4`和`Int8`精度的模型进行读取,帮助我们节省显存开销。我们也提供了相关性能数据。我们发现尽管模型在效果上存在损失,但模型的显存开销大幅降低。
240
+
241
+ | Precision | MMLU | GPU Memory for Loading Model |
242
+ | ----------- | :------: | :---------------------------: |
243
+ | BF16 | 56.7 | 16.38G |
244
+ | Int8 | 52.8 | 10.44G |
245
+ | NF4 | 48.9 | 7.79G |
246
+
247
+ 注:表中显存占用的测试环境为A100-SXM4-80G单卡,PyTorch 2.0.1,CUDA 11.8,开启flash attention
248
+
249
+ ## 推理性能
250
+
251
+ ### 推理速度
252
+
253
+ 我们分别测试了BF16和量化条件下,模型生成2K tokens的平均推理速度,结果如下
254
+
255
+ | 量化等级 | 开flash_attn的推理速度 (字符/秒) | 关flash_attn的推理速度 (字符/秒) |
256
+ | ------ | :---------------------------: | :---------------------------: |
257
+ | BF16 (无量化) | 30.06 | 27.55 |
258
+ | Int8 (bnb) | 7.94 | 7.86 |
259
+ | NF4 (bnb) | 21.43 | 20.37 |
260
+
261
+ 具体的评测方式为:指定输入context长度为1,生成长度为2048;测试硬件为A100-SXM4-80G单卡,软件环境为PyTorch 2.0.1,CUDA版本11.8,计算生成该2048序列的平均速度
262
+
263
+ ### 显存占用
264
+
265
+ 在BF16和不同量化条件下,我们分别测算了模型编码2048长度序列(并生成1个token),和生成8192长度序列(编码1个token作为context)的峰值显存占用。结果如下
266
+
267
+ 打开flash attention时
268
+
269
+ | 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
270
+ | --- | :---: | :---: |
271
+ | BF16 | 18.11GB | 23.52GB |
272
+ | Int8 | 12.17GB | 17.60GB |
273
+ | NF4 | 9.52GB | 14.93GB |
274
+
275
+ 关闭flash attention时
276
+
277
+ | 量化等级 | 编码 2048 长度的峰值显存 | 生成 8192 长度的峰值显存 |
278
+ | --- | :---: | :---: |
279
+ | BF16 | 18.11GB | 24.40GB |
280
+ | Int8 | 12.18GB | 18.47GB |
281
+ | NF4 | 9.52GB | 15.81GB |
282
+
283
+ 以上测速和显存占用情况,均可通过该[评测脚本](https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py)测算得到。
284
+
285
+ ## Demo
286
+
287
+ ### Web UI
288
+
289
+ 我们提供了Web UI的demo供用户使用 (感谢 @wysaid 支持)。在开始前,确保已经安装如下代码库:
290
+
291
+ ```
292
+ pip install -r requirements_web_demo.txt
293
+ ```
294
+
295
+ 随后运行如下命令,并点击生成链接:
296
+
297
+ ```
298
+ python web_demo.py
299
+ ```
300
+
301
+ <p align="center">
302
+ <br>
303
+ <img src="assets/web_demo.gif" width="600" />
304
+ <br>
305
+ <p>
306
+
307
+
308
+ ### 交互式Demo
309
+
310
+ 我们提供了一个简单的交互式Demo示例,请查看`cli_demo.py`。当前模型已经支持流式输出,用户可通过输入文字的方式和Qwen-7B-Chat交互,模型将流式输出返回结果。运行如下命令:
311
+
312
+ ```
313
+ python cli_demo.py
314
+ ```
315
+
316
+ <p align="center">
317
+ <br>
318
+ <img src="assets/cli_demo.gif" width="600" />
319
+ <br>
320
+ <p>
321
+
322
+ ## API
323
+
324
+ 我们提供了OpenAI API格式的本地API部署方法(感谢@hanpenggit)。在开始之前先安装必要的代码库:
325
+
326
+ ```bash
327
+ pip install fastapi uvicorn openai pydantic sse_starlette
328
+ ```
329
+
330
+ 随后即可运行以下命令部署你的本地API:
331
+
332
+ ```bash
333
+ python openai_api.py
334
+ ```
335
+
336
+ 你也可以修改参数,比如`-c`来修改模型名称或路径, `--cpu-only`改为CPU部署等等。如果部署出现问题,更新上述代码库往往可以解决大多数问题。
337
+
338
+ 使用API同样非常简单,示例如下:
339
+
340
+ ```python
341
+ import openai
342
+ openai.api_base = "http://localhost:8000/v1"
343
+ openai.api_key = "none"
344
+
345
+ # 使用流式回复的请求
346
+ for chunk in openai.ChatCompletion.create(
347
+ model="Qwen-7B",
348
+ messages=[
349
+ {"role": "user", "content": "你好"}
350
+ ],
351
+ stream=True
352
+ ):
353
+ if hasattr(chunk.choices[0].delta, "content"):
354
+ print(chunk.choices[0].delta.content, end="", flush=True)
355
+
356
+ # 不使用流式回复的请求
357
+ response = openai.ChatCompletion.create(
358
+ model="Qwen-7B",
359
+ messages=[
360
+ {"role": "user", "content": "你好"}
361
+ ],
362
+ stream=False
363
+ )
364
+ print(response.choices[0].message.content)
365
+ ```
366
+
367
+ <p align="center">
368
+ <br>
369
+ <img src="assets/openai_api.gif" width="600" />
370
+ <br>
371
+ <p>
372
+
373
+ ## 工具调用
374
+
375
+ Qwen-7B-Chat针对包括API、数据库、模型等工具在内的调用进行了优化。用户可以开发基于Qwen-7B的LangChain、Agent甚至Code Interpreter。在我们开源的[评测数据集](eval/EVALUATION.md)上测试模型的工具调用能力,并发现Qwen-7B-Chat能够取得稳定的表现。
376
+
377
+ | Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
378
+ |:------------|:----------------------:|:----------------------:|:----------------------:|
379
+ | GPT-4 | 95% | **0.90** | 15% |
380
+ | GPT-3.5 | 85% | 0.88 | 75% |
381
+ | **Qwen-7B** | **99%** | 0.89 | **9.7%** |
382
+
383
+ 我们提供了文档说明如何根据ReAct Prompting的原则写作你的prompt。
384
+
385
+ For how to write and use prompts for ReAct Prompting, please refer to [the ReAct examples](examples/react_prompt.md)。
386
+
387
+ 此外,我们还提供了实验结果表明我们的模型扮演Agent的能力。请阅读相关文档[链接](https://huggingface.co/docs/transformers/transformers_agents)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下:
388
+
389
+ | Model | Tool Selection↑ | Tool Used↑ | Code↑ |
390
+ |:---------------|:---------------:|:-----------:|:---------:|
391
+ |GPT-4 | **100** | **100** | **97.41** |
392
+ |GPT-3.5 | 95.37 | 96.30 | 87.04 |
393
+ |StarCoder-15.5B | 87.04 | 87.96 | 68.89 |
394
+ | **Qwen-7B** | 90.74 | 92.59 | 74.07 |
395
+
396
+ ## 长文本理解
397
+
398
+ 我们引入了NTK插值、窗口注意力、LogN注意力缩放等技术来提升模型的上下文长度并突破训练序列长度的限制。我们的模型已经突破8K的序列长度。通过arXiv数据集上的语言模型实验,我们发现Qwen-7B能够在长序列的设置下取得不错的表现。
399
+
400
+ <table>
401
+ <tr>
402
+ <th rowspan="2">Model</th><th colspan="5" align="center">Sequence Length</th>
403
+ </tr>
404
+ <tr>
405
+ <th align="center">1024</th><th align="center">2048</th><th align="center">4096</th><th align="center">8192</th><th align="center">16384</th>
406
+ </tr>
407
+ <tr>
408
+ <td>Qwen-7B</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">39.35</td><td align="center">469.81</td><td align="center">2645.09</td>
409
+ </tr>
410
+ <tr>
411
+ <td>+ dynamic_ntk</td><td align="center"><b>4.23</b></td><td align="center"><b>3.78</b></td><td align="center">3.59</td><td align="center">3.66</td><td align="center">5.71</td>
412
+ </tr>
413
+ <tr>
414
+