yangapku commited on
Commit
af9914c
1 Parent(s): 580a183

update tokenization_qwen.py

Browse files
Files changed (2) hide show
  1. README.md +5 -5
  2. tokenization_qwen.py +4 -10
README.md CHANGED
@@ -10,7 +10,7 @@ pipeline_tag: text-generation
10
  # Qwen-7B
11
 
12
  <p align="center">
13
- <img src="assets/logo.jpg" width="400"/>
14
  <p>
15
  <br>
16
 
@@ -29,7 +29,7 @@ pipeline_tag: text-generation
29
  2. **强大的性能**:Qwen-7B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
30
  3. **覆盖更全面的词表**:相比目前以中英词表为主的开源模型,Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
31
 
32
- 如果您想了解更多关于通义千问7B开源模型的细节,我们建议您参阅Github代码库。
33
 
34
  **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
35
 
@@ -39,7 +39,7 @@ The features of Qwen-7B include:
39
  2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
40
  3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
41
 
42
- For more details about the open-source model of Qwen-7B, please refer to the Github code repository.
43
 
44
  ## 依赖项 (Dependency)
45
 
@@ -83,9 +83,9 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
83
  # 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
84
  ```
85
 
86
- 关于更多的使用说明,请参考我们的Github repo获取更多信息。
87
 
88
- For more information, please refer to our Github repo for more information.
89
 
90
  ## 模型细节 (Model)
91
 
 
10
  # Qwen-7B
11
 
12
  <p align="center">
13
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo.jpg" width="400"/>
14
  <p>
15
  <br>
16
 
 
29
  2. **强大的性能**:Qwen-7B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。
30
  3. **覆盖更全面的词表**:相比目前以中英词表为主的开源模型,Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。
31
 
32
+ 如果您想了解更多关于通义千问7B开源模型的细节,我们建议您参阅[Github代码库](https://github.com/QwenLM/Qwen-7B)。
33
 
34
  **Qwen-7B** is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B.
35
 
 
39
  2. **Competitive performance**: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results.
40
  3. **More comprehensive vocabulary coverage**: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary.
41
 
42
+ For more details about the open-source model of Qwen-7B, please refer to the [Github](https://github.com/QwenLM/Qwen-7B) code repository.
43
 
44
  ## 依赖项 (Dependency)
45
 
 
83
  # 蒙古国的首都是乌兰巴托(Ulaanbaatar)\n冰岛的首都是雷克雅未克(Reykjavik)\n埃塞俄比亚的首都是亚的斯亚贝巴(Addis Ababa)...
84
  ```
85
 
86
+ 关于更多的使用说明,请参考我们的[Github repo](https://github.com/QwenLM/Qwen-7B)获取更多信息。
87
 
88
+ For more information, please refer to our [Github repo](https://github.com/QwenLM/Qwen-7B) for more information.
89
 
90
  ## 模型细节 (Model)
91
 
tokenization_qwen.py CHANGED
@@ -20,7 +20,7 @@ from transformers import PreTrainedTokenizer, AddedToken
20
 
21
  logger = logging.getLogger(__name__)
22
 
23
- TIKTOKEN_NAME = "qwen.tiktoken"
24
 
25
 
26
  class QWenTokenizer(PreTrainedTokenizer):
@@ -28,17 +28,11 @@ class QWenTokenizer(PreTrainedTokenizer):
28
 
29
  """NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
30
 
31
- @classmethod
32
- def from_pretrained(
33
- cls, pretrained_model_name_or_path, cache_dir=None, *inputs, **kwargs
34
- ):
35
- merges_file = os.path.join(pretrained_model_name_or_path, TIKTOKEN_NAME)
36
- tokenizer = cls(merges_file, *inputs, **kwargs)
37
- return tokenizer
38
 
39
  def __init__(
40
  self,
41
- merges_file,
42
  errors="replace",
43
  max_len=None,
44
  unk_token="<|endoftext|>",
@@ -113,7 +107,7 @@ class QWenTokenizer(PreTrainedTokenizer):
113
  )
114
  }
115
 
116
- mergeable_ranks = load_tiktoken_bpe(merges_file)
117
  special_tokens = {
118
  token: index
119
  for index, token in enumerate(special_tokens, start=len(mergeable_ranks))
 
20
 
21
  logger = logging.getLogger(__name__)
22
 
23
+ VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken"}
24
 
25
 
26
  class QWenTokenizer(PreTrainedTokenizer):
 
28
 
29
  """NOTE: This tokenizer will not handle special tokens to avoid injection attacks"""
30
 
31
+ vocab_files_names = VOCAB_FILES_NAMES
 
 
 
 
 
 
32
 
33
  def __init__(
34
  self,
35
+ vocab_file,
36
  errors="replace",
37
  max_len=None,
38
  unk_token="<|endoftext|>",
 
107
  )
108
  }
109
 
110
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
111
  special_tokens = {
112
  token: index
113
  for index, token in enumerate(special_tokens, start=len(mergeable_ranks))