File size: 5,690 Bytes
368da34 6cdd2d5 368da34 4efd183 ad283e8 23ecbf4 fbe5fcf d63fd4f 900107e bfd94da fbe5fcf 3d112c8 4efd183 6cdd2d5 ad283e8 4efd183 0d5e6ab 4efd183 0d5e6ab ad283e8 4efd183 0d5e6ab 100bcde 4efd183 0d5e6ab 100bcde 0d5e6ab 2002129 0d5e6ab |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
language: ar
license: apache-2.0
datasets:
- AQMAR
- ANERcorp
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- Text Classification
- token-classification
- sequence-tagger-model
metrics:
- f1
widget:
- text: "لائحة «الوطنية للصحافة».. خطوة جديدة في طريق «الحصار»"
---
# Arabic NER Model using Flair Embeddings
Training was conducted over 94 epochs, using a linear decaying learning rate of 2e-05, starting from 0.225 and a batch size of 32 with GloVe and Flair forward and backward embeddings.
## Original Datasets:
- [AQMAR](http://www.cs.cmu.edu/~ark/ArabicNER/)
- [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp)
## Results:
- F1-score (micro) 0.8666
- F1-score (macro) 0.8488
| | True Posititves | False Positives | False Negatives | Precision | Recall | class-F1 |
|------|-----|----|----|-----------|--------|----------|
| LOC | 539 | 51 | 68 | 0.9136 | 0.8880 | 0.9006 |
| MISC | 408 | 57 | 89 | 0.8774 | 0.8209 | 0.8482 |
| ORG | 167 | 43 | 64 | 0.7952 | 0.7229 | 0.7574 |
| PER | 501 | 65 | 60 | 0.8852 | 0.8930 | 0.8891 |
---
# Usage
```python
from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from flair.tokenization import JapaneseTokenizer
from icecream import ic
tagger = SequenceTagger.load("julien-c/flair-ner")
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner')
sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .')
# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)
# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)
```
# Example
```bash
2021-07-07 14:30:59,649 loading file /Users/mega/.flair/models/flair-ner/f22eb997f66ae2eacad974121069abaefca5fe85fce71b49e527420ff45b9283.941c7c30b38aef8d8a4eb5c1b6dd7fe8583ff723fef457382589ad6a4e859cfc
2021-07-07 14:31:04,654 loading file /Users/mega/.flair/models/flair-arabic-multi-ner/c7af7ddef4fdcc681fcbe1f37719348afd2862b12aa1cfd4f3b93bd2d77282c7.242d030cb106124f7f9f6a88fb9af8e390f581d42eeca013367a86d585ee6dd6
ic| sentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "George Washington went to Washington ." [− Tokens: 6 − Token-Labels: "George <B-PER> Washington <E-PER> went to Washington <S-LOC> ."]>
ic| arSentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة ." [− Tokens: 11 − Token-Labels: "عمرو <B-PER> عادلي <I-PER> أستاذ للاقتصاد السياسي المساعد في الجامعة <B-ORG> الأمريكية <I-ORG> بالقاهرة <B-LOC> ."]>
ic| entity: <PER-span (1,2): "George Washington">
ic| entity: <LOC-span (5): "Washington">
ic| entity: <PER-span (1,2): "عمرو عادلي">
ic| entity: <ORG-span (8,9): "الجامعة الأمريكية">
ic| entity: <LOC-span (10): "بالقاهرة">
ic| sentence.to_dict(tag_type='ner'):
{"text":"عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .",
"labels":[],
{"entities":[{{{
"text":"عمرو عادلي",
"start_pos":0,
"end_pos":10,
"labels":[PER (0.9826)]},
{"text":"الجامعة الأمريكية",
"start_pos":45,
"end_pos":62,
"labels":[ORG (0.7679)]},
{"text":"بالقاهرة",
"start_pos":64,
"end_pos":72,
"labels":[LOC (0.8079)]}]}
"text":"George Washington went to Washington .",
"labels":[],
"entities":[{
{"text":"George Washington",
"start_pos":0,
"end_pos":17,
"labels":[PER (0.9968)]},
{"text":"Washington""start_pos":26,
"end_pos":36,
"labels":[LOC (0.9994)]}}]}
```
# Model Configuration
```python
SequenceTagger(
(embeddings): StackedEmbeddings(
(list_embedding_0): WordEmbeddings('glove')
(list_embedding_1): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.1, inplace=False)
(encoder): Embedding(7125, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=7125, bias=True)
)
)
(list_embedding_2): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.1, inplace=False)
(encoder): Embedding(7125, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=7125, bias=True)
)
)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
(rnn): LSTM(4196, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=15, bias=True)
(beta): 1.0
(weights): None
(weight_tensor) None
```
Due to some formatting errors, your code might appear like [this](https://ibb.co/ky20Lnq), attempted on 2020-10-27 12:05:47,801
# Citation
*if you use this model in your work, please consider citing this work:*
```latex
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In preparation",
}
``` |