metadata

license: apache-2.0
language:
  - zh

Model Card for Chinese MRC roberta_wwm_ext_large

Model Details

Model Description

使用大量中文MRC数据训练的roberta_wwm_ext_large模型，详情可查看

Developed by: luhua-rain
Shared by [Optional]: luhua-rain
Model type: Question Answering
Language(s) (NLP): Chinese
License: Apache 2.0
Parent Model: BERT
Resources for more information:
- GitHub Repo

Uses

Direct Use

The model authors also note in the GitHub Repo

此mrc模型可直接用于open domain，点击体验

Downstream Use [Optional]

The model authors also note in the GitHub Repo

将此模型放到下游 MRC/分类任务微调可比直接使用预训练语言模型提高2个点/1个点以上

Out-of-Scope Use

The model should not be used to intentionally create hostile or alienating environments for people.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

The model authors also note in the GitHub Repo

网上收集的大量中文MRC数据（其中包括公开的MRC数据集以及自己爬取的网页数据等，囊括了医疗、教育、娱乐、百科、军事、法律、等领域。）

Training Procedure

Preprocessing

The model authors also note in the GitHub Repo:

清洗舍弃：context>1024的舍弃、question>64的舍弃、网页标签占比超过30%的舍弃。重新标注：若answer>64且不完全出现在文档中，则采用模糊匹配: 计算所有片段与answer的相似度(F1值)，取相似度最高的且高于阈值（0.8） 数据标注 收集的数据有一部分是不包含的位置标签的，仅仅是(问题-文章-答案)的三元组形式。所以，对于只有答案而没有位置标签的数据通过正则匹配进行位置标注：若答案片段多次出现在文章中，选择上下文与问题最相似的答案片段作为标准答案（使用F1值计算相似度，答案片段的上文48和下文48个字符作为上下文）；若答案片段只出现一次，则默认该答案为标准答案。采用滑动窗口将长文档切分为多个重叠的子文档，故一个文档可能会生成多个有答案的子文档。 无答案数据构造 在跨领域数据上训练可以增加数据的领域多样性，进而提高模型的泛化能力，而负样本的引入恰好能使得模型编码尽可能多的数据，加强模型对难样本的识别能力： 1.) 对于每一个问题，随机从数据中捞取context，并保留对应的title作为负样本;（50%） 2.) 对于每一个问题，将其正样本中答案出现的句子删除，以此作为负样本；（20%） 3.) 对于每一个问题，使用BM25算法召回得分最高的前十个文档，然后根据得分采样出一个context作为负样本，对于非实体类答案，剔除得分最高的context（30%）

Speeds, Sizes, Times

More information needed

Evaluation

Testing Data, Factors & Metrics

Testing Data

More information needed

Factors

More information needed

Metrics

More information needed

Results

此库发布的再训练模型，在阅读理解/分类等任务上均有大幅提高
（已有多位小伙伴在Dureader-2021等多个比赛中取得top5的成绩😁）

模型/数据集	Dureader-2021	tencentmedical
	F1-score	Accuracy
	dev / A榜	test-1
macbert-large (哈工大预训练语言模型)	65.49 / 64.27	82.5
roberta-wwm-ext-large (哈工大预训练语言模型)	65.49 / 64.27	82.5
macbert-large (ours)	70.45 / 68.13	83.4
roberta-wwm-ext-large (ours)	68.91 / 66.91	83.1

   | 68.91 / 66.91   |    83.1        |

Model Examination

More information needed

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: More information needed
Hours used: More information needed
Cloud Provider: More information needed
Compute Region: More information needed
Carbon Emitted: More information needed

Technical Specifications [optional]

Model Architecture and Objective

More information needed

Compute Infrastructure

More information needed

Hardware

More information needed

Software

More information needed.

Citation

BibTeX:

More information needed

Glossary [optional]

More information needed

More Information [optional]

The model authors also note in the GitHub Repo

代码上传前已经跑通。文件不多，所以如果碰到报错之类的信息，可能是代码路径不对、缺少安装包等问题，一步步解决，可以提issue 环境

Model Card Authors [optional]

Luhua-rain in collaboration with Ezi Ozoani and the Hugging Face team

Model Card Contact

The model authors also note in the GitHub Repo

合作相关训练数据以及使用更多数据训练的模型/一起打比赛可邮箱联系(luhua98@foxmail.com)~

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

 ----- 使用方法 -----
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "chinese_pretrain_mrc_roberta_wwm_ext_large" # "chinese_pretrain_mrc_macbert_large"

# Use in Transformers
tokenizer = AutoTokenizer.from_pretrained(f"luhua/{model_name}")
model = AutoModelForQuestionAnswering.from_pretrained(f"luhua/{model_name}")

# Use locally（通过 https://huggingface.co/luhua 下载模型及配置文件）
tokenizer = BertTokenizer.from_pretrained(f'./{model_name}')
model = AutoModelForQuestionAnswering.from_pretrained(f'./{model_name}')