system HF staff commited on
Commit
8eabdf3
1 Parent(s): cf6ba4a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # How the model was trained
2
+ This model is based on BERTurk
3
+ https://huggingface.co/dbmdz/bert-base-turkish-cased
4
+
5
+ ## DataSet
6
+ Training dataset is WikiAnn
7
+
8
+ * The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset can be loaded with the DaNLP package:
9
+
10
+ https://www.aclweb.org/anthology/P17-1178.pdf
11
+
12
+ Thank to @stefan-it, I downloaded the data from the link as follows
13
+
14
+ ```
15
+ mkdir tr-data
16
+
17
+ cd tr-data
18
+
19
+ for file in train.txt dev.txt test.txt labels.txt
20
+ do
21
+ wget https://schweter.eu/storage/turkish-bert-wikiann/$file
22
+ done
23
+ ```
24
+
25
+ ## Fine-tuning the bert-model
26
+ The base bert model is dbmdz/bert-base-turkish-cased . With following system environment
27
+
28
+ ```
29
+ export MAX_LENGTH=128
30
+ export BERT_MODEL=dbmdz/bert-base-turkish-cased
31
+ export OUTPUT_DIR=tr-new-model
32
+ export BATCH_SIZE=32
33
+ export NUM_EPOCHS=3
34
+ export SAVE_STEPS=625
35
+ export SEED=1
36
+
37
+ ```
38
+
39
+ I run the following ner-training code(you can find it under transformer github repo)
40
+
41
+
42
+ ```
43
+ Then run training:
44
+
45
+ python3 run_ner.py --data_dir ./tr-data3 \
46
+ --model_type bert \
47
+ --labels ./tr-data/labels.txt \
48
+ --model_name_or_path $BERT_MODEL \
49
+ --output_dir $OUTPUT_DIR-$SEED \
50
+ --max_seq_length $MAX_LENGTH \
51
+ --num_train_epochs $NUM_EPOCHS \
52
+ --per_gpu_train_batch_size $BATCH_SIZE \
53
+ --save_steps $SAVE_STEPS \
54
+ --seed $SEED \
55
+ --do_train \
56
+ --do_eval \
57
+ --do_predict \
58
+ --fp16
59
+ ```
60
+
61
+ If you dont have GPU-enabled computer, please skip last --fp16 parameter.
62
+ Finally, you can find your trained model and model performance unde tr-new-model folder
63
+
64
+
65
+ ## Some Results
66
+
67
+ ###Performance for WikiANN dataset
68
+ ```
69
+ cat tr-new-model-1/eval_results.txt
70
+ cat tr-new-model-1/test_results.txt
71
+
72
+ *Eval Results:*
73
+
74
+ precision = 0.916400580551524
75
+ recall = 0.9342309684101502
76
+ f1 = 0.9252298787412536
77
+ loss = 0.11335893666411284
78
+
79
+ *Test Results:*
80
+ precision = 0.9192058759362955
81
+ recall = 0.9303010230367262
82
+ f1 = 0.9247201697271198
83
+ loss = 0.11182546521618497
84
+
85
+ ```
86
+
87
+ ### Performance with another dataset at the link
88
+ https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
89
+
90
+ ```
91
+ savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
92
+ precision = 0.9461980692049029
93
+ recall = 0.959309358847465
94
+ f1 = 0.9527086063783312
95
+ loss = 0.037054269206847804
96
+
97
+ savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
98
+ precision = 0.9458370635631155
99
+ recall = 0.9588201928530913
100
+ f1 = 0.952284378344882
101
+ loss = 0.035431676572445225
102
+ ```
103
+
104
+ # Usage
105
+
106
+ You should install transformer library first
107
+
108
+ ```
109
+ pip install transformers
110
+ ```
111
+
112
+ And, open python environment and run the following code
113
+
114
+ ```
115
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
116
+
117
+ model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
118
+ tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
119
+ ner=pipeline('ner', model=model, tokenizer=tokenizer)
120
+ ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
121
+
122
+ [{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]
123
+
124
+ ```
125
+
126
+
127
+
128
+
129
+
130
+
131
+
132
+
133
+
134
+
135
+
136
+
137
+
138
+
139
+
140
+
141
+