trueto commited on
Commit
280fc36
1 Parent(s): d4688db

update from trueto

Browse files
Files changed (4) hide show
  1. README.md +38 -0
  2. config.json +27 -0
  3. pytorch_model.bin +3 -0
  4. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [medbert](https://github.com/trueto/medbert)
2
+ 本项目开源硕士毕业论文“BERT模型在中文临床自然语言处理中的应用探索与研究”相关模型
3
+
4
+ ## 评估基准
5
+ 构建了中文电子病历命名实体识别数据集(CEMRNER)、中文医学文本命名实体识别数据集(CMTNER)、
6
+
7
+ 中文医学问句-问句识别数据集(CMedQQ)和中文临床文本分类数据集(CCTC)。
8
+
9
+ | **数据集** | **训练集** | **验证集** | **测试集** | **任务类型** | **语料来源** |
10
+ | ---- | ---- | ---- |---- |---- |:----:|
11
+ | CEMRNER | 965 | 138 | 276 | 命名实体识别 | 医渡云 |
12
+ | CMTNER | 14000 | 2000 | 4000 | 命名实体识别 | CHIP2020 |
13
+ | CMedQQ | 14000 | 2000 | 4000 | 句对识别 | 平安医疗 |
14
+ | CCTC | 26837 | 3834 | 7669 | 句子分类 | CHIP2019 |
15
+
16
+ ## 开源模型
17
+ 在6.5亿字符中文临床自然语言文本语料上基于BERT模型和Albert模型预训练获得了MedBERT和MedAlbert模型。
18
+
19
+ ## 性能表现
20
+ 在同等实验环境,相同训练参数和脚本下,各模型的性能表现
21
+
22
+ | **模型** | **CEMRNER** | **CMTNER** | **CMedQQ** | **CCTC** |
23
+ | :---- | :----: | :----: | :----: | :----: |
24
+ | [BERT](https://huggingface.co/bert-base-chinese) | 81.17% | 65.67% | 87.77% | 81.62% |
25
+ | [MC-BERT](https://github.com/alibaba-research/ChineseBLUE) | 80.93% | 66.15% | 89.04% | 80.65% |
26
+ | [PCL-BERT](https://code.ihub.org.cn/projects/1775) | 81.58% | 67.02% | 88.81% | 80.27% |
27
+ | MedBERT | 82.29% | 66.49% | 88.32% | **81.77%** |
28
+ |MedBERT-wwm| **82.60%** | 67.11% | 88.02% | 81.72% |
29
+ |MedBERT-kd | 82.58% | **67.27%** | **89.34%** | 80.73% |
30
+ |- | - | - | - | - |
31
+ | [Albert](https://huggingface.co/voidful/albert_chinese_base) | 79.98% | 62.42% | 86.81% | 79.83% |
32
+ | MedAlbert | 81.03% | 63.81% | 87.56% | 80.05% |
33
+ |MedAlbert-wwm| **81.28%** | **64.12%** | **87.71%** | **80.46%** |
34
+
35
+ ## 引用格式
36
+ ```
37
+ 杨飞洪,王序文,李姣.BERT模型在中文临床自然语言处理中的应用探索与研究[EB/OL].https://github.com/trueto/medbert, 2021-03.
38
+ ```
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0,
3
+ "bos_token_id": 2,
4
+ "classifier_dropout_prob": 0.1,
5
+ "down_scale_factor": 1,
6
+ "embedding_size": 128,
7
+ "eos_token_id": 3,
8
+ "gap_size": 0,
9
+ "hidden_act": "relu",
10
+ "hidden_dropout_prob": 0,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "inner_group_num": 1,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "layers_to_keep": [],
17
+ "max_position_embeddings": 512,
18
+ "model_type": "albert",
19
+ "net_structure_type": 0,
20
+ "num_attention_heads": 12,
21
+ "num_hidden_groups": 1,
22
+ "num_hidden_layers": 12,
23
+ "num_memory_blocks": 0,
24
+ "pad_token_id": 0,
25
+ "type_vocab_size": 2,
26
+ "vocab_size": 21128
27
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a71a754d798b68323aa5c0f1c72e12b80945a8ca15bb66ac5706bc01d0a6430b
3
+ size 42695704
vocab.txt ADDED
The diff for this file is too large to render. See raw diff