trueto
commited on
Commit
•
f932297
1
Parent(s):
3a99bfc
update from trueto
Browse files- README.md +38 -0
- config.json +27 -0
- pytorch_model.bin +3 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [medbert](https://github.com/trueto/medbert)
|
2 |
+
本项目开源硕士毕业论文“BERT模型在中文临床自然语言处理中的应用探索与研究”相关模型
|
3 |
+
|
4 |
+
## 评估基准
|
5 |
+
构建了中文电子病历命名实体识别数据集(CEMRNER)、中文医学文本命名实体识别数据集(CMTNER)、
|
6 |
+
|
7 |
+
中文医学问句-问句识别数据集(CMedQQ)和中文临床文本分类数据集(CCTC)。
|
8 |
+
|
9 |
+
| **数据集** | **训练集** | **验证集** | **测试集** | **任务类型** | **语料来源** |
|
10 |
+
| ---- | ---- | ---- |---- |---- |:----:|
|
11 |
+
| CEMRNER | 965 | 138 | 276 | 命名实体识别 | 医渡云 |
|
12 |
+
| CMTNER | 14000 | 2000 | 4000 | 命名实体识别 | CHIP2020 |
|
13 |
+
| CMedQQ | 14000 | 2000 | 4000 | 句对识别 | 平安医疗 |
|
14 |
+
| CCTC | 26837 | 3834 | 7669 | 句子分类 | CHIP2019 |
|
15 |
+
|
16 |
+
## 开源模型
|
17 |
+
在6.5亿字符中文临床自然语言文本语料上基于BERT模型和Albert模型预训练获得了MedBERT和MedAlbert模型。
|
18 |
+
|
19 |
+
## 性能表现
|
20 |
+
在同等实验环境,相同训练参数和脚本下,各模型的性能表现
|
21 |
+
|
22 |
+
| **模型** | **CEMRNER** | **CMTNER** | **CMedQQ** | **CCTC** |
|
23 |
+
| :---- | :----: | :----: | :----: | :----: |
|
24 |
+
| [BERT](https://huggingface.co/bert-base-chinese) | 81.17% | 65.67% | 87.77% | 81.62% |
|
25 |
+
| [MC-BERT](https://github.com/alibaba-research/ChineseBLUE) | 80.93% | 66.15% | 89.04% | 80.65% |
|
26 |
+
| [PCL-BERT](https://code.ihub.org.cn/projects/1775) | 81.58% | 67.02% | 88.81% | 80.27% |
|
27 |
+
| MedBERT | 82.29% | 66.49% | 88.32% | **81.77%** |
|
28 |
+
|MedBERT-wwm| **82.60%** | 67.11% | 88.02% | 81.72% |
|
29 |
+
|MedBERT-kd | 82.58% | **67.27%** | **89.34%** | 80.73% |
|
30 |
+
|- | - | - | - | - |
|
31 |
+
| [Albert](https://huggingface.co/voidful/albert_chinese_base) | 79.98% | 62.42% | 86.81% | 79.83% |
|
32 |
+
| MedAlbert | 81.03% | 63.81% | 87.56% | 80.05% |
|
33 |
+
|MedAlbert-wwm| **81.28%** | **64.12%** | **87.71%** | **80.46%** |
|
34 |
+
|
35 |
+
## 引用格式
|
36 |
+
```
|
37 |
+
杨飞洪,王序文,李姣.BERT模型在中文临床自然语言处理中的应用探索与研究[EB/OL].https://github.com/trueto/medbert, 2021-03.
|
38 |
+
```
|
config.json
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/home/yfh/bertology_models/bert-base-chinese/",
|
3 |
+
"architectures": [
|
4 |
+
"BertForPreTraining"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"directionality": "bidi",
|
8 |
+
"gradient_checkpointing": false,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 512,
|
16 |
+
"model_type": "bert",
|
17 |
+
"num_attention_heads": 12,
|
18 |
+
"num_hidden_layers": 12,
|
19 |
+
"pad_token_id": 0,
|
20 |
+
"pooler_fc_size": 768,
|
21 |
+
"pooler_num_attention_heads": 12,
|
22 |
+
"pooler_num_fc_layers": 3,
|
23 |
+
"pooler_size_per_head": 128,
|
24 |
+
"pooler_type": "first_token_transform",
|
25 |
+
"type_vocab_size": 2,
|
26 |
+
"vocab_size": 21128
|
27 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6dc11cb223ddc6c810554709c2ae99540864749ef7fdfa49e4824c77b2c0618d
|
3 |
+
size 409166089
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|