Qishuai commited on
Commit
0bed7e4
1 Parent(s): 5b22bdb

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Punctuator for Simplified Chinese
2
+
3
+ The model is fine-tuned based on `DistilBertForTokenClassification` for adding punctuations to plain text (simplified Chinese). The model is fine-tuned based on distilled model `bert-base-chinese`.
4
+
5
+ ## Usage
6
+
7
+ ```python
8
+ from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
9
+
10
+ model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_zh")
11
+ tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_zh")
12
+ ```
13
+
14
+ ## Model Overview
15
+
16
+ ### Training data
17
+ Combination of following three dataset:
18
+
19
+ - News articles of People's Daily 2014. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus)
20
+
21
+ ### Model Performance
22
+ - Validation with MSRA training dataset. [Reference](https://github.com/InsaneLife/ChineseNLPCorpus/tree/master/NER/MSRA)
23
+ - Metrics Report:
24
+ | | precision | recall | f1-score | support |
25
+ |:----------------:|:---------:|:------:|:--------:|:-------:|
26
+ | C_COMMA | 0.67 | 0.59 | 0.63 | 91566 |
27
+ | C_DUNHAO | 0.50 | 0.37 | 0.42 | 21013 |
28
+ | C_EXLAMATIONMARK | 0.23 | 0.06 | 0.09 | 399 |
29
+ | C_PERIOD | 0.84 | 0.99 | 0.91 | 44258 |
30
+ | C_QUESTIONMARK | 0.00 | 1.00 | 0.00 | 0 |
31
+ | micro avg | 0.71 | 0.67 | 0.69 | 157236 |
32
+ | macro avg | 0.45 | 0.60 | 0.41 | 157236 |
33
+ | weighted avg | 0.69 | 0.67 | 0.68 | 157236 |
34
+
35
+