Arabic NER Model using Flair Embeddings

Training was conducted over 94 epochs, using a linear decaying learning rate of 2e-05, starting from 0.225 and a batch size of 32 with GloVe and Flair forward and backward embeddings.

Original Datasets:

Results:

  • F1-score (micro) 0.8666
  • F1-score (macro) 0.8488
True Posititves False Positives False Negatives Precision Recall class-F1
LOC 539 51 68 0.9136 0.8880 0.9006
MISC 408 57 89 0.8774 0.8209 0.8482
ORG 167 43 64 0.7952 0.7229 0.7574
PER 501 65 60 0.8852 0.8930 0.8891

Usage

from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from flair.tokenization import JapaneseTokenizer
from icecream import ic

tagger = SequenceTagger.load("julien-c/flair-ner")
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner')

sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .')


# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)

# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)

Example

2021-07-07 14:30:59,649 loading file /Users/mega/.flair/models/flair-ner/f22eb997f66ae2eacad974121069abaefca5fe85fce71b49e527420ff45b9283.941c7c30b38aef8d8a4eb5c1b6dd7fe8583ff723fef457382589ad6a4e859cfc
2021-07-07 14:31:04,654 loading file /Users/mega/.flair/models/flair-arabic-multi-ner/c7af7ddef4fdcc681fcbe1f37719348afd2862b12aa1cfd4f3b93bd2d77282c7.242d030cb106124f7f9f6a88fb9af8e390f581d42eeca013367a86d585ee6dd6
ic| sentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "George Washington went to Washington ."   [− Tokens: 6  − Token-Labels: "George <B-PER> Washington <E-PER> went to Washington <S-LOC> ."]>
ic| arSentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة ."   [− Tokens: 11  − Token-Labels: "عمرو <B-PER> عادلي <I-PER> أستاذ للاقتصاد السياسي المساعد في الجامعة <B-ORG> الأمريكية <I-ORG> بالقاهرة <B-LOC> ."]>
ic| entity: <PER-span (1,2): "George Washington">
ic| entity: <LOC-span (5): "Washington">
ic| entity: <PER-span (1,2): "عمرو عادلي">
ic| entity: <ORG-span (8,9): "الجامعة الأمريكية">
ic| entity: <LOC-span (10): "بالقاهرة">
ic| sentence.to_dict(tag_type='ner'): 
{"text":"عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية  بالقاهرة .",
"labels":[],
{"entities":[{{{
               "text":"عمرو عادلي",
               "start_pos":0,
               "end_pos":10,
               "labels":[PER (0.9826)]},
            {"text":"الجامعة الأمريكية",
               "start_pos":45,
               "end_pos":62,
               "labels":[ORG (0.7679)]},
            {"text":"بالقاهرة",
               "start_pos":64,
               "end_pos":72,
               "labels":[LOC (0.8079)]}]}
"text":"George Washington went to Washington .",
"labels":[],
"entities":[{
           {"text":"George Washington",
            "start_pos":0,
            "end_pos":17,
            "labels":[PER (0.9968)]},
           {"text":"Washington""start_pos":26,
            "end_pos":36,
            "labels":[LOC (0.9994)]}}]}

Model Configuration

SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(7125, 100)
        (rnn): LSTM(100, 2048)
        (decoder): Linear(in_features=2048, out_features=7125, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4196, out_features=4196, bias=True)
  (rnn): LSTM(4196, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=15, bias=True)
  (beta): 1.0
  (weights): None
  (weight_tensor) None

Due to some formatting errors, your code might appear like this, attempted on 2020-10-27 12:05:47,801

Citation

if you use this model in your work, please consider citing this work:

@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In preparation",
}
Downloads last month
47
Hosted inference API
Token Classification
This model can be loaded on the Inference API on-demand.