File size: 1,542 Bytes
dc2510c
 
6cddc41
b247ff3
b44e3a3
b247ff3
 
 
 
 
dc2510c
b247ff3
 
 
6cddc41
b247ff3
 
 
 
 
6cddc41
b44e3a3
 
 
 
 
6cddc41
b44e3a3
6cddc41
b44e3a3
 
6cddc41
 
b44e3a3
6cddc41
b247ff3
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---

license: apache-2.0
language: zh
tags:
- Token Classification
metrics:
- precision
- recall
- f1
- accuracy
---


## Model description

This model is a fine-tuned version of macbert for the purpose of spell checking in medical application scenarios. We fine-tuned macbert Chinese base version on a 300M dataset including 60K+ authorized medical articles. We proposed to randomly confuse 30% sentences of these articles by adding noise with a either visually or phonologically resembled characters. Consequently, the fine-tuned model can achieve 96% accuracy on our test dataset. 

## Intended uses & limitations

You can use this model directly with a pipeline for token classification:
```python

>>> from transformers import (AutoModelForTokenClassification, AutoTokenizer)

>>> from transformers import pipeline



>>> hub_model_id = "9pinus/macbert-base-chinese-medical-collation"



>>> model = AutoModelForTokenClassification.from_pretrained(hub_model_id)

>>> tokenizer = AutoTokenizer.from_pretrained(hub_model_id)

>>> classifier = pipeline('ner', model=model, tokenizer=tokenizer)

>>> result = classifier("ε¦‚ζžœη—…ζƒ…θΎƒι‡οΌŒε―ι€‚ε½“ε£ζœη”²θ‚–ε”‘η‰‡γ€ηŽ―ι…―ηΊ’ιœ‰η΄ η‰‡η­‰θ―η‰©θΏ›θ‘ŒζŠ—ζ„ŸζŸ“ι•‡η—›γ€‚")



>>> for item in result:

>>>     if item['entity'] == 1:

>>>         print(item)



{'entity': 1, 'score': 0.58127016, 'index': 14, 'word': 'θ‚–', 'start': 13, 'end': 14}



```

### Framework versions

- Transformers 4.15.0
- Pytorch 1.10.1+cu113
- Datasets 1.17.0
- Tokenizers 0.10.3