init commit
Browse files- README.md +111 -0
- config.json +31 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- spiece.model +3 -0
- tokenizer_config.json +1 -0
README.md
CHANGED
@@ -1,3 +1,114 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language: zh
|
4 |
+
tags:
|
5 |
+
- Text2Text Generation
|
6 |
+
- T5
|
7 |
+
- chinese
|
8 |
+
- sentencepiece
|
9 |
+
inference: true
|
10 |
+
widget:
|
11 |
+
- text: "新闻分类任务:【微软披露拓扑量子计算机计划!】这篇文章的类别是什么?故事/文化/娱乐/体育/财经/房产/汽车/教育/科技"
|
12 |
+
- type: "text-generation"
|
13 |
---
|
14 |
+
|
15 |
+
# Randeng-T5-784M-MultiTask-Chinese
|
16 |
+
|
17 |
+
- Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
|
18 |
+
- Docs: [Fengshenbang-Docs](https://fengshenbang-doc.readthedocs.io/)
|
19 |
+
|
20 |
+
## 简介 Brief Introduction
|
21 |
+
|
22 |
+
在Randeng-T5-784M的基础上,收集了100个左右的中文数据集,进行Text2Text统一范式的有监督任务预训练。
|
23 |
+
|
24 |
+
On the basis of Randeng-T5-784M, about 100 Chinese datasets were collected and pre-trained for the supervised task of Text2Text unified paradigm.
|
25 |
+
|
26 |
+
## 模型分类 Model Taxonomy
|
27 |
+
|
28 |
+
| 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
|
29 |
+
| :----: | :----: | :----: | :----: | :----: | :----: |
|
30 |
+
| 通用 General | 自然语言转换 NLT | 燃灯 Randeng | MultiTask | 784M | 多任务-中文 MultiTask-Chinese |
|
31 |
+
|
32 |
+
|
33 |
+
## 模型信息 Model Information
|
34 |
+
|
35 |
+
参考论文:[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](http://jmlr.org/papers/v21/20-074.html)
|
36 |
+
|
37 |
+
基于[Randeng-T5-784M](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M),我们在收集的100+个中文领域的多任务数据集(从中采样了30w+个样本)上微调了它,得到了此多任务版本。这些多任务包括:情感分析,新闻分类,文本分类,意图识别,自然语言推理,多项选择,指代消解,抽取式阅读理解,实体识别,关键词抽取,生成式摘要。
|
38 |
+
|
39 |
+
Based on [Randeng-T5-784M](https://huggingface.co/IDEA-CCNL/Randeng-T5-784M), we fine-tuned it on a collection of 100+ multitasking datasets in Chinese domains (from which 30w+ samples were sampled) to obtain this multitasking version. These multitasks include: sentiment analysis, news classification, text classification, intention recognition, natural language inference, multiple choice, denotational disambiguation, extractive reading comprehension, entity recognition, keyword extraction, and generative summarization.
|
40 |
+
|
41 |
+
|
42 |
+
## 使用 Usage
|
43 |
+
|
44 |
+
```python
|
45 |
+
import torch
|
46 |
+
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration
|
47 |
+
|
48 |
+
# load tokenizer and model
|
49 |
+
pretrained_model = "IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese"
|
50 |
+
|
51 |
+
special_tokens = ["<extra_id_{}>".format(i) for i in range(100)]
|
52 |
+
tokenizer = T5Tokenizer.from_pretrained(
|
53 |
+
args.pretrained_model,
|
54 |
+
do_lower_case=True,
|
55 |
+
max_length=512,
|
56 |
+
truncation=True,
|
57 |
+
additional_special_tokens=special_tokens,
|
58 |
+
)
|
59 |
+
config = T5Config.from_pretrained(args.pretrained_model)
|
60 |
+
model = T5ForConditionalGeneration.from_pretrained(args.pretrained_model, config=config)
|
61 |
+
model.resize_token_embeddings(len(tokenizer))
|
62 |
+
model.eval()
|
63 |
+
|
64 |
+
# tokenize
|
65 |
+
text = "新闻分类任务:【微软披露拓扑量子计算机计划!】这篇文章的类别是什么?故事/文化/娱乐/体育/财经/房产/汽车/教育/科技"
|
66 |
+
encode_dict = tokenizer(text, max_length=512, padding='max_length',truncation=True)
|
67 |
+
|
68 |
+
inputs = {
|
69 |
+
"input_ids": torch.tensor(encode_dict['input_ids']).long(),
|
70 |
+
"attention_mask": torch.tensor(encode_dict['attention_mask']).long(),
|
71 |
+
}
|
72 |
+
|
73 |
+
# generate answer
|
74 |
+
logits = model.generate(
|
75 |
+
input_ids = inputs['input_ids'],
|
76 |
+
max_length=100,
|
77 |
+
do_sample= True
|
78 |
+
# early_stopping=True,
|
79 |
+
)
|
80 |
+
|
81 |
+
logits=logits[:,1:]
|
82 |
+
predict_label = [tokenizer.decode(i,skip_special_tokens=True) for i in logits]
|
83 |
+
|
84 |
+
# model Output: 科技
|
85 |
+
```
|
86 |
+
|
87 |
+
## 引用 Citation
|
88 |
+
|
89 |
+
如果您在您的工作中使用了我们的模型,可以引用我们的[论文](https://arxiv.org/abs/2209.02970):
|
90 |
+
|
91 |
+
If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2209.02970):
|
92 |
+
|
93 |
+
```text
|
94 |
+
@article{fengshenbang,
|
95 |
+
author = {Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen and Ruyi Gan and Jiaxing Zhang},
|
96 |
+
title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
|
97 |
+
journal = {CoRR},
|
98 |
+
volume = {abs/2209.02970},
|
99 |
+
year = {2022}
|
100 |
+
}
|
101 |
+
```
|
102 |
+
|
103 |
+
也可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
|
104 |
+
|
105 |
+
You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
|
106 |
+
|
107 |
+
```text
|
108 |
+
@misc{Fengshenbang-LM,
|
109 |
+
title={Fengshenbang-LM},
|
110 |
+
author={IDEA-CCNL},
|
111 |
+
year={2021},
|
112 |
+
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
|
113 |
+
}
|
114 |
+
```
|
config.json
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/cognitive_comp/wuxiaojun/pretrained/pytorch/Randeng-T5-784M",
|
3 |
+
"architectures": [
|
4 |
+
"T5ForConditionalGeneration"
|
5 |
+
],
|
6 |
+
"d_ff": 2816,
|
7 |
+
"d_kv": 64,
|
8 |
+
"d_model": 1024,
|
9 |
+
"decoder_start_token_id": 0,
|
10 |
+
"dropout_rate": 0.1,
|
11 |
+
"eos_token_id": 1,
|
12 |
+
"feed_forward_proj": "gated-gelu",
|
13 |
+
"initializer_factor": 1.0,
|
14 |
+
"is_encoder_decoder": true,
|
15 |
+
"layer_norm_epsilon": 1e-06,
|
16 |
+
"max_length": 200,
|
17 |
+
"model_type": "t5",
|
18 |
+
"num_decoder_layers": 24,
|
19 |
+
"num_heads": 16,
|
20 |
+
"num_layers": 24,
|
21 |
+
"output_past": true,
|
22 |
+
"pad_token_id": 0,
|
23 |
+
"relative_attention_max_distance": 128,
|
24 |
+
"relative_attention_num_buckets": 32,
|
25 |
+
"tie_word_embeddings": false,
|
26 |
+
"tokenizer_class": "T5Tokenizer",
|
27 |
+
"torch_dtype": "float32",
|
28 |
+
"transformers_version": "4.18.0",
|
29 |
+
"use_cache": true,
|
30 |
+
"vocab_size": 32596
|
31 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2f7231e9482940b12b4e067030c9f25b0aee562a88ab9d1683fc1612febcc30e
|
3 |
+
size 3136623589
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c65feffa65ff0378759778193852083d23349cb1b40c906e9463a12f8076ff32
|
3 |
+
size 680811
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 0, "additional_special_tokens": [], "sp_model_kwargs": {}, "name_or_path": "/cognitive_comp/wuxiaojun/pretrained/pytorch/Randeng-T5-784M", "do_lower_case": true, "max_length": 1024, "truncation": true, "special_tokens_map_file": "/cognitive_comp/wuxiaojun/pretrained/pytorch/Randeng-T5-784M/special_tokens_map.json", "tokenizer_class": "T5Tokenizer"}
|