wenge-research
commited on
Commit
·
448cc18
1
Parent(s):
6a0d7fc
Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
|
|
4 |
---
|
5 |
license: apache-2.0
|
6 |
---
|
7 |
-
# 雅意IE
|
8 |
|
9 |
<div align="center">
|
10 |
<img src="./assets/yayi_dark_small.png" alt="YaYi" style="width: 30%; display: block; margin: auto;">
|
@@ -16,22 +16,24 @@ license: apache-2.0
|
|
16 |
</div>
|
17 |
|
18 |
|
19 |
-
##
|
20 |
-
|
21 |
|
22 |
通过雅意IE大模型的开源为促进中文预训练大模型开源社区的发展,贡献自己的一份力量,通过开源,与每一位合作伙伴共建雅意大模型生态。
|
|
|
23 |
|
24 |
-
|
|
|
|
|
25 |
|
26 |
-
|
27 |
-
|
28 |
-
| --------- | --------- | --------- |
|
29 |
-
| YAYI-UIE | wenge-research/yayi-uie | [模型下载](https://huggingface.co/wenge-research/yayi-uie) |
|
30 |
|
|
|
31 |
|
|
|
32 |
|
33 |
-
####
|
34 |
-
以下是一个简单调用 `YAYI-UIE` 进行下游任务推理的示例代码,可在单张 A100/A800 等GPU运行,使用BF16精度推理时约占用 32GB 显存:
|
35 |
```python
|
36 |
import torch
|
37 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -47,11 +49,7 @@ response = model.generate(**inputs, max_new_tokens=512, temperature=0)
|
|
47 |
print(tokenizer.decode(response[0],skip_special_tokens=True))
|
48 |
```
|
49 |
|
50 |
-
####
|
51 |
-
注:
|
52 |
-
- 指令前加入具体任务类型用中括号表示【】(可加可不加)
|
53 |
-
- 为了让模型能抽取更全的信息,尽量在指令中加入细粒度的提示,比如“会见地点”,“会议地点”等,而不是统一为“地点”。
|
54 |
-
- 尽量输入文本放置在前,指令在后。
|
55 |
|
56 |
1. 实体抽取任务
|
57 |
```
|
@@ -72,17 +70,32 @@ print(tokenizer.decode(response[0],skip_special_tokens=True))
|
|
72 |
文本:xx
|
73 |
已知论元角色列表是[质押方,披露时间,质权方,质押物,质押股票/股份数量,事件时间,质押物所属公司,质押物占总股比,质押物占持股比],请根据论元角色列表从给定的输入中抽取可能的论元,以json{角色:论元,}格式输出。
|
74 |
```
|
|
|
|
|
75 |
```
|
76 |
-
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
```
|
79 |
|
80 |
-
|
81 |
-
|
|
|
82 |
|
83 |
AI,Literature,Music,Politics,Science为英文数据集,boson,clue,weibo为中文数据集
|
|
|
84 |
|
85 |
-
|
|
86 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
87 |
| davinci | 2.97 | 9.87 | 13.83 | 18.42 | 10.04 | 11.03 | - | - | - | 31.09 |
|
88 |
| ChatGPT 3.5 | **54.4** | **54.07** | **61.24** | **59.12** | **63** | **58.37** | 38.53 | 25.44 | 29.3 |
|
@@ -92,11 +105,13 @@ AI,Literature,Music,Politics,Science为英文数据集,boson,clue,
|
|
92 |
| DeepKE-LLM | 13.76 | 20.18 | 14.78 | 33.86 | 9.19 | 18.35 | 25.96 | 4.44 | 25.2 | 18.53 |
|
93 |
| YAYI-UIE | 52.4 | 45.99 | 51.2 | 51.82 | 50.53 | 50.39 | **49.25** | **36.46** | 36.78 | **40.83** |
|
94 |
|
95 |
-
2. RE
|
96 |
|
97 |
FewRe,Wiki-ZSL为英文数据集, SKE 2020,COAE2016,IPRE为中文数据集
|
98 |
|
99 |
-
|
|
|
|
|
100 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
101 |
| ChatGPT 3.5 | 9.96 | 13.14 | 11.55 24.47 | 19.31 | 6.73 | 16.84 |
|
102 |
| ZETT(T5-small) | 30.53 | 31.74 | 31.14 | - | - | - | - |
|
@@ -105,11 +120,13 @@ FewRe,Wiki-ZSL为英文数据集, SKE 2020,COAE2016,IPRE为中文数据
|
|
105 |
| DeepKE-LLM | 17.46 | 15.33 | 16.40 | 0.4 | 6.56 | 9.75 |5.57|
|
106 |
| YAYI-UIE | 36.09 | **41.07** | **38.58** | **70.8** | **19.97** | **22.97**| **37.91**|
|
107 |
|
108 |
-
3. EE
|
109 |
|
110 |
commodity news为英文数据集,FewFC,ccf_law为中文数据集
|
111 |
|
112 |
-
|
|
|
|
|
113 |
|
114 |
| 模型 | commodity news | FewFC | ccf_law | 中文平均 |
|
115 |
| ------ | ------ | ------ | ------ | ------ |
|
@@ -118,7 +135,7 @@ EET(事件类型判别)
|
|
118 |
|InstructUIE| **23.26** | - | - | - |
|
119 |
| YAYI-UIE | 12.45 | **81.28** | **12.87** | **47.08**|
|
120 |
|
121 |
-
EEA
|
122 |
|
123 |
| 模型 | commodity news | FewFC | ccf_law | 中文平均 |
|
124 |
| ------ | ------ | ------ | ------ | ------ |
|
@@ -127,31 +144,35 @@ EEA(事件论元抽取)
|
|
127 |
|InstructUIE| **21.78** | - | - | - |
|
128 |
| YAYI-UIE | 19.74 | **63.06** | 59.42 | **61.24** |
|
129 |
|
|
|
130 |
|
131 |
<div align="center">
|
132 |
<br>
|
133 |
|
134 |
![零样本推理性能分布](./assets/zh-0shot.png)
|
135 |
</div>
|
136 |
-
##
|
137 |
-
####
|
138 |
基于当前数据和基础模型训练得到的SFT模型,在效果上仍存在以下问题:
|
139 |
|
140 |
1. 抽取的信息可能会产生违背事实的错误回答。
|
141 |
2. 对于具备危害性的指令无法很好的鉴别,可能会产生危害性言论。
|
142 |
3. 在一些涉及段落级长文本的场景下模型的抽取能力仍有待提高。
|
143 |
|
144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
145 |
基于以上模型局限性,我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的,不得用于商业用途,以及其他会对社会带来危害的用途。请谨慎鉴别和使用雅意大模型生成的内容,请勿将生成的有害内容传播至互联网。若产生不良后果,由传播者自负。
|
146 |
本项目仅可应用于研究目的,项目开发者不承担任何因使用本项目(包含但不限于数据、模型、代码等)导致的危害或损失。详细请参考免责声明。
|
147 |
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
## 更新日志
|
152 |
-
- [2023/12/07] 雅意IE大模型正式对外发布并开源 30B 版本模型权重。
|
153 |
|
154 |
-
|
155 |
-
|
156 |
-
- 本项目分布式训练使用了 Microsoft 的 [DeepSpeed](https://github.com/microsoft/deepspeed) 分布式训练工具及 Huggingface transformers 文档中的 [ZeRO stage 2](https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-config) 配置文件;
|
157 |
-
- 我们非常感谢以下开源项目对我们的帮助:[InstructUIE](https://github.com/BeyonderXX/InstructUIE/tree/master); [InstructIE](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC); [DeepKE-LLM](https://github.com/zjunlp/KnowLM/tree/main)
|
|
|
4 |
---
|
5 |
license: apache-2.0
|
6 |
---
|
7 |
+
# 雅意IE大模型/YAYI UIE
|
8 |
|
9 |
<div align="center">
|
10 |
<img src="./assets/yayi_dark_small.png" alt="YaYi" style="width: 30%; display: block; margin: auto;">
|
|
|
16 |
</div>
|
17 |
|
18 |
|
19 |
+
## 介绍/Introduction
|
20 |
+
雅意信息抽取统一大模型 (YAYI-UIE)在百万级人工构造的高质量信息抽取数据上进行指令微调得到,统一训练信息抽取任务包括命名实体识别(NER),关系抽取(RE)和事件抽取(EE),实现通用、安全、金融、生物、医疗、商业、个人、车辆、电影、工业、餐厅、科学等场景下结构化抽取。
|
21 |
|
22 |
通过雅意IE大模型的开源为促进中文预训练大模型开源社区的发展,贡献自己的一份力量,通过开源,与每一位合作伙伴共建雅意大模型生态。
|
23 |
+
模型下载地址是 https://huggingface.co/wenge-research/yayi-uie
|
24 |
|
25 |
+
The YAYI Unified Information Extraction Large Language Model (YAYI UIE), fine-tuned on millions of high-quality data, integrates training across tasks such as Named Entity
|
26 |
+
Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). The model is able to extract structured outputs across diverse fields including general, security,
|
27 |
+
finance, biology, medicine, business, personal, automotive, film, industry, restaurant, and science.
|
28 |
|
29 |
+
The open-source of YAYI-UIE aims to foster the growth of the Chinese PLM open-source community. We can't wait to collaborate with our partners to develop the YAYI Large
|
30 |
+
Models ecosystem!
|
|
|
|
|
31 |
|
32 |
+
![instruction](./assets/YAYI-UIE-1.png)
|
33 |
|
34 |
+
The downloand link is https://huggingface.co/wenge-research/yayi-uie
|
35 |
|
36 |
+
#### 模型推理/Model Inference
|
|
|
37 |
```python
|
38 |
import torch
|
39 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
49 |
print(tokenizer.decode(response[0],skip_special_tokens=True))
|
50 |
```
|
51 |
|
52 |
+
#### 指令样例/Sample Prompts
|
|
|
|
|
|
|
|
|
53 |
|
54 |
1. 实体抽取任务
|
55 |
```
|
|
|
70 |
文本:xx
|
71 |
已知论元角色列表是[质押方,披露时间,质权方,质押物,质押股票/股份数量,事件时间,质押物所属公司,质押物占总股比,质押物占持股比],请根据论元角色列表从给定的输入中抽取可能的论元,以json{角色:论元,}格式输出。
|
72 |
```
|
73 |
+
|
74 |
+
1. NER
|
75 |
```
|
76 |
+
Text:
|
77 |
+
From the given text, extract all the entities and types. Please format the answer in json {person/organization/location:[entities]}.
|
78 |
+
```
|
79 |
+
2. RE
|
80 |
+
```
|
81 |
+
Text:
|
82 |
+
From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples.
|
83 |
+
The relations are [country of administrative divisions,place of birth,location contains]. Output the result in json[{'relation':'', 'head':'', 'tail':''}, ].
|
84 |
+
```
|
85 |
+
3. EE
|
86 |
+
```
|
87 |
+
Text:
|
88 |
+
Given the text and the role list [seller, place, beneficiary, buyer], identify event arguments and roles, provide your answer in the format of json{role:name}.
|
89 |
```
|
90 |
|
91 |
+
|
92 |
+
## 模型zero-shot评测/Zero-shot Evaluation
|
93 |
+
1. NER任务/NER tasks
|
94 |
|
95 |
AI,Literature,Music,Politics,Science为英文数据集,boson,clue,weibo为中文数据集
|
96 |
+
AI,Literature,Music,Politics and Science are English datasets; boson,clue and weibo are Chinese datasets
|
97 |
|
98 |
+
| Model | AI | Literature | Music | Politics | Science | EN Average | boson | clue | weibo | ZH Average |
|
99 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
100 |
| davinci | 2.97 | 9.87 | 13.83 | 18.42 | 10.04 | 11.03 | - | - | - | 31.09 |
|
101 |
| ChatGPT 3.5 | **54.4** | **54.07** | **61.24** | **59.12** | **63** | **58.37** | 38.53 | 25.44 | 29.3 |
|
|
|
105 |
| DeepKE-LLM | 13.76 | 20.18 | 14.78 | 33.86 | 9.19 | 18.35 | 25.96 | 4.44 | 25.2 | 18.53 |
|
106 |
| YAYI-UIE | 52.4 | 45.99 | 51.2 | 51.82 | 50.53 | 50.39 | **49.25** | **36.46** | 36.78 | **40.83** |
|
107 |
|
108 |
+
2. RE任务/RE Tasks
|
109 |
|
110 |
FewRe,Wiki-ZSL为英文数据集, SKE 2020,COAE2016,IPRE为中文数据集
|
111 |
|
112 |
+
FewRe and Wiki-ZSL are English datasets; SKE 2020, COAE2016 and IPRE are Chinese datasets
|
113 |
+
|
114 |
+
| Model | FewRe | Wiki-ZSL | EN Average | SKE 2020 | COAE2016 | IPRE | ZH Average |
|
115 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
116 |
| ChatGPT 3.5 | 9.96 | 13.14 | 11.55 24.47 | 19.31 | 6.73 | 16.84 |
|
117 |
| ZETT(T5-small) | 30.53 | 31.74 | 31.14 | - | - | - | - |
|
|
|
120 |
| DeepKE-LLM | 17.46 | 15.33 | 16.40 | 0.4 | 6.56 | 9.75 |5.57|
|
121 |
| YAYI-UIE | 36.09 | **41.07** | **38.58** | **70.8** | **19.97** | **22.97**| **37.91**|
|
122 |
|
123 |
+
3. EE任务/EE Tasks
|
124 |
|
125 |
commodity news为英文数据集,FewFC,ccf_law为中文数据集
|
126 |
|
127 |
+
commodity news is a English dataset, FewFC and ccf_law are Chinese datasets
|
128 |
+
|
129 |
+
EET(事件类型判别 Event Type Extraction)
|
130 |
|
131 |
| 模型 | commodity news | FewFC | ccf_law | 中文平均 |
|
132 |
| ------ | ------ | ------ | ------ | ------ |
|
|
|
135 |
|InstructUIE| **23.26** | - | - | - |
|
136 |
| YAYI-UIE | 12.45 | **81.28** | **12.87** | **47.08**|
|
137 |
|
138 |
+
EEA(事件论元抽取 Event Arguments Extraction)
|
139 |
|
140 |
| 模型 | commodity news | FewFC | ccf_law | 中文平均 |
|
141 |
| ------ | ------ | ------ | ------ | ------ |
|
|
|
144 |
|InstructUIE| **21.78** | - | - | - |
|
145 |
| YAYI-UIE | 19.74 | **63.06** | 59.42 | **61.24** |
|
146 |
|
147 |
+
The chart illustrates the performance of our model on Chinese IE tasks in zero-shot setting.
|
148 |
|
149 |
<div align="center">
|
150 |
<br>
|
151 |
|
152 |
![零样本推理性能分布](./assets/zh-0shot.png)
|
153 |
</div>
|
154 |
+
## 相关协议/Terms and Conditions
|
155 |
+
#### 局限性/Limitations
|
156 |
基于当前数据和基础模型训练得到的SFT模型,在效果上仍存在以下问题:
|
157 |
|
158 |
1. 抽取的信息可能会产生违背事实的错误回答。
|
159 |
2. 对于具备危害性的指令无法很好的鉴别,可能会产生危害性言论。
|
160 |
3. 在一些涉及段落级长文本的场景下模型的抽取能力仍有待提高。
|
161 |
|
162 |
+
|
163 |
+
The SFT model, trained using the data and the base model, still faces the following issues:
|
164 |
+
|
165 |
+
1. The information extracted may lead to factually incorrect answers.
|
166 |
+
2. It struggles to effectively discern harmful instructions, potentially resulting in hazardous statements.
|
167 |
+
3. The model's extraction capability needs improvement in scenarios involving paragraph-level texts.
|
168 |
+
|
169 |
+
#### 免责声明/Disclaimer
|
170 |
基于以上模型局限性,我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的,不得用于商业用途,以及其他会对社会带来危害的用途。请谨慎鉴别和使用雅意大模型生成的内容,请勿将生成的有害内容传播至互联网。若产生不良后果,由传播者自负。
|
171 |
本项目仅可应用于研究目的,项目开发者不承担任何因使用本项目(包含但不限于数据、模型、代码等)导致的危害或损失。详细请参考免责声明。
|
172 |
|
173 |
+
Given the limitations of the model outlined above,we require developers to use the code, data, models, and any derivatives generated from this project solely for research
|
174 |
+
purposes. They must not be used for commercial purposes or other applications that could harm society. Users should be careful in discerning and utilizing content generated
|
175 |
+
by the YAYI UIE, and avoid distributing harmful content on the internet. The spreader bears sole responsibility for any adverse consequences.
|
|
|
|
|
176 |
|
177 |
+
This project is intended only for research purposes. The project developers are not liable for any harm or loss resulting from the use of this project, including but not
|
178 |
+
limited to data, models, and code. For more details, please refer to the disclaimer.
|
|
|
|