Update README.md
Browse files
README.md
ADDED
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# How the model was trained
|
2 |
+
This model is based on BERTurk
|
3 |
+
https://huggingface.co/dbmdz/bert-base-turkish-cased
|
4 |
+
|
5 |
+
## DataSet
|
6 |
+
Training dataset is WikiAnn
|
7 |
+
|
8 |
+
* The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset can be loaded with the DaNLP package:
|
9 |
+
|
10 |
+
https://www.aclweb.org/anthology/P17-1178.pdf
|
11 |
+
|
12 |
+
Thank to @stefan-it, I downloaded the data from the link as follows
|
13 |
+
|
14 |
+
```
|
15 |
+
mkdir tr-data
|
16 |
+
|
17 |
+
cd tr-data
|
18 |
+
|
19 |
+
for file in train.txt dev.txt test.txt labels.txt
|
20 |
+
do
|
21 |
+
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
|
22 |
+
done
|
23 |
+
```
|
24 |
+
|
25 |
+
## Fine-tuning the bert-model
|
26 |
+
The base bert model is dbmdz/bert-base-turkish-cased . With following system environment
|
27 |
+
|
28 |
+
```
|
29 |
+
export MAX_LENGTH=128
|
30 |
+
export BERT_MODEL=dbmdz/bert-base-turkish-cased
|
31 |
+
export OUTPUT_DIR=tr-new-model
|
32 |
+
export BATCH_SIZE=32
|
33 |
+
export NUM_EPOCHS=3
|
34 |
+
export SAVE_STEPS=625
|
35 |
+
export SEED=1
|
36 |
+
|
37 |
+
```
|
38 |
+
|
39 |
+
I run the following ner-training code(you can find it under transformer github repo)
|
40 |
+
|
41 |
+
|
42 |
+
```
|
43 |
+
Then run training:
|
44 |
+
|
45 |
+
python3 run_ner.py --data_dir ./tr-data3 \
|
46 |
+
--model_type bert \
|
47 |
+
--labels ./tr-data/labels.txt \
|
48 |
+
--model_name_or_path $BERT_MODEL \
|
49 |
+
--output_dir $OUTPUT_DIR-$SEED \
|
50 |
+
--max_seq_length $MAX_LENGTH \
|
51 |
+
--num_train_epochs $NUM_EPOCHS \
|
52 |
+
--per_gpu_train_batch_size $BATCH_SIZE \
|
53 |
+
--save_steps $SAVE_STEPS \
|
54 |
+
--seed $SEED \
|
55 |
+
--do_train \
|
56 |
+
--do_eval \
|
57 |
+
--do_predict \
|
58 |
+
--fp16
|
59 |
+
```
|
60 |
+
|
61 |
+
If you dont have GPU-enabled computer, please skip last --fp16 parameter.
|
62 |
+
Finally, you can find your trained model and model performance unde tr-new-model folder
|
63 |
+
|
64 |
+
|
65 |
+
## Some Results
|
66 |
+
|
67 |
+
###Performance for WikiANN dataset
|
68 |
+
```
|
69 |
+
cat tr-new-model-1/eval_results.txt
|
70 |
+
cat tr-new-model-1/test_results.txt
|
71 |
+
|
72 |
+
*Eval Results:*
|
73 |
+
|
74 |
+
precision = 0.916400580551524
|
75 |
+
recall = 0.9342309684101502
|
76 |
+
f1 = 0.9252298787412536
|
77 |
+
loss = 0.11335893666411284
|
78 |
+
|
79 |
+
*Test Results:*
|
80 |
+
precision = 0.9192058759362955
|
81 |
+
recall = 0.9303010230367262
|
82 |
+
f1 = 0.9247201697271198
|
83 |
+
loss = 0.11182546521618497
|
84 |
+
|
85 |
+
```
|
86 |
+
|
87 |
+
### Performance with another dataset at the link
|
88 |
+
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
|
89 |
+
|
90 |
+
```
|
91 |
+
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
|
92 |
+
precision = 0.9461980692049029
|
93 |
+
recall = 0.959309358847465
|
94 |
+
f1 = 0.9527086063783312
|
95 |
+
loss = 0.037054269206847804
|
96 |
+
|
97 |
+
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
|
98 |
+
precision = 0.9458370635631155
|
99 |
+
recall = 0.9588201928530913
|
100 |
+
f1 = 0.952284378344882
|
101 |
+
loss = 0.035431676572445225
|
102 |
+
```
|
103 |
+
|
104 |
+
# Usage
|
105 |
+
|
106 |
+
You should install transformer library first
|
107 |
+
|
108 |
+
```
|
109 |
+
pip install transformers
|
110 |
+
```
|
111 |
+
|
112 |
+
And, open python environment and run the following code
|
113 |
+
|
114 |
+
```
|
115 |
+
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
116 |
+
|
117 |
+
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
118 |
+
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
119 |
+
ner=pipeline('ner', model=model, tokenizer=tokenizer)
|
120 |
+
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
|
121 |
+
|
122 |
+
[{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]
|
123 |
+
|
124 |
+
```
|
125 |
+
|
126 |
+
|
127 |
+
|
128 |
+
|
129 |
+
|
130 |
+
|
131 |
+
|
132 |
+
|
133 |
+
|
134 |
+
|
135 |
+
|
136 |
+
|
137 |
+
|
138 |
+
|
139 |
+
|
140 |
+
|
141 |
+
|