File size: 7,412 Bytes
9c587bb
7394a8b
9c587bb
 
 
 
5645004
7394a8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a6113
 
 
 
 
 
 
 
 
 
 
 
7394a8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5645004
7394a8b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: apache-2.0
language: 
- zh
---



# Model Card for Chinese MRC roberta_wwm_ext_large
 
# Model Details
 
## Model Description
 
使用大量中文MRC数据训练的roberta_wwm_ext_large模型,[详情可查看](https://github.com/basketballandlearn/MRC_Competition_Dureader)
 
- **Developed by:** luhua-rain
- **Shared by [Optional]:**  luhua-rain
- **Model type:** Question Answering 
- **Language(s) (NLP):** Chinese
- **License:** Apache 2.0 
- **Parent Model:** BERT
- **Resources for more information:**
  - [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
 	


# Uses
 

## Direct Use
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
> 此mrc模型可直接用于open domain,点击体验
 
## Downstream Use [Optional]
 
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
> 将此模型放到下游 MRC/分类 任务微调可比直接使用预训练语言模型提高2个点/1个点以上
 
## Out-of-Scope Use
 
The model should not be used to intentionally create hostile or alienating environments for people. 
 
# Bias, Risks, and Limitations
 
 
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.



## Recommendations
 
 
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.


# Training Details
 
## Training Data
 
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
> 网上收集的大量中文MRC数据 (其中包括公开的MRC数据集以及自己爬取的网页数据等, 囊括了医疗、教育、娱乐、百科、军事、法律、等领域。)
 
## Training Procedure

 
### Preprocessing
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader): 
>**清洗**
舍弃:context>1024的舍弃、question>64的舍弃、网页标签占比超过30%的舍弃。
重新标注:若answer>64且不完全出现在文档中,则采用模糊匹配: 计算所有片段与answer的相似度(F1值),取相似度最高的且高于阈值(0.8)
**数据标注**
收集的数据有一部分是不包含的位置标签的,仅仅是(问题-文章-答案)的三元组形式。 所以,对于只有答案而没有位置标签的数据通过正则匹配进行位置标注:
 若答案片段多次出现在文章中,选择上下文与问题最相似的答案片段作为标准答案(使用F1值计算相似度,答案片段的上文48和下文48个字符作为上下文);
 若答案片段只出现一次,则默认该答案为标准答案。
采用滑动窗口将长文档切分为多个重叠的子文档,故一个文档可能会生成多个有答案的子文档。
**无答案数据构造**
     在跨领域数据上训练可以增加数据的领域多样性,进而提高模型的泛化能力,而负样本的引入恰好能使得模型编码尽可能多的数据,加强模型对难样本的识别能力:
1.)  对于每一个问题,随机从数据中捞取context,并保留对应的title作为负样本;(50%)
2.)  对于每一个问题,将其正样本中答案出现的句子删除,以此作为负样本;(20%)
3.)  对于每一个问题,使用BM25算法召回得分最高的前十个文档,然后根据得分采样出一个context作为负样本, 对于非实体类答案,剔除得分最高的context(30%)
 
 
 


 
### Speeds, Sizes, Times
More information needed 

 
# Evaluation
 
 
## Testing Data, Factors & Metrics
 
### Testing Data
 
More information needed 
 
 
### Factors
More information needed
 
### Metrics
 
More information needed
 
 
## Results 

* 此库发布的再训练模型,在 阅读理解/分类 等任务上均有大幅提高<br/>
(已有多位小伙伴在Dureader-2021等多个比赛中取得**top5**的成绩😁)

|                模型/数据集                 |  Dureader-2021  |  tencentmedical |
| ------------------------------------------|--------------- | --------------- |
|                                           |    F1-score    |    Accuracy     |
|                                           |  dev / A榜     |     test-1      |
| macbert-large (哈工大预训练语言模型)         | 65.49 / 64.27  |     82.5        |
| roberta-wwm-ext-large (哈工大预训练语言模型) | 65.49 / 64.27  |     82.5        |
| macbert-large (ours)                      | 70.45 / **68.13**|   **83.4**    |
| roberta-wwm-ext-large (ours)              | 68.91 / 66.91   |    83.1        |

       | 68.91 / 66.91   |    83.1        |




 
# Model Examination
 
More information needed
 
# Environmental Impact
 
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
 
# Technical Specifications [optional]
 
## Model Architecture and Objective

More information needed 
 
## Compute Infrastructure
 
More information needed 
 
### Hardware
 
 
More information needed
 
### Software
 
More information needed.
 
# Citation

 
**BibTeX:**
 
 
More information needed 
 
 
 
 
# Glossary [optional]
More information needed 
 
# More Information [optional]
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
> 代码上传前已经跑通。文件不多,所以如果碰到报错之类的信息,可能是代码路径不对、缺少安装包等问题,一步步解决,可以提issue
环境
 

 
# Model Card Authors [optional]
 
Luhua-rain in collaboration with Ezi Ozoani and the Hugging Face team


# Model Card Contact
 
The model authors also note in the [GitHub Repo](https://github.com/basketballandlearn/MRC_Competition_Dureader)
> 合作
相关训练数据以及使用更多数据训练的模型/一起打比赛 可邮箱联系(luhua98@foxmail.com)~

 
# How to Get Started with the Model
 
Use the code below to get started with the model.
 
<details>
<summary> Click to expand </summary>

```python
 ----- 使用方法 -----
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "chinese_pretrain_mrc_roberta_wwm_ext_large" # "chinese_pretrain_mrc_macbert_large"

# Use in Transformers
tokenizer = AutoTokenizer.from_pretrained(f"luhua/{model_name}")
model = AutoModelForQuestionAnswering.from_pretrained(f"luhua/{model_name}")

# Use locally(通过 https://huggingface.co/luhua 下载模型及配置文件)
tokenizer = BertTokenizer.from_pretrained(f'./{model_name}')
model = AutoModelForQuestionAnswering.from_pretrained(f'./{model_name}')
 ```
</details>