Bangla-Person-Name-Extractor

This repository contains the implementation of a Bangla Person Name Extractor model which is able to extract Person name entities from a given sentence. We approached it as a token classification task i.e. tagging each token with either a Person's name or not. We leveraged the BanglaBERT model for our task, finetuning it for a binary classification task using a custom-prepare dataset. We have deployed the model into huggingface for easier access and use case.

How to use it?

This Notebook contains the required Inference Template on a sentence.

You can also directly infer using the following code snippet. Just change the sentence.

from transformers import AutoModelForPreTraining, AutoTokenizer,AutoModelForTokenClassification #!pip install transformers==4.30.2
from normalizer import normalize #pip install git+https://github.com/csebuetnlp/normalizer
import torch #pip install torch
import numpy as np #!pip install numpy==1.23.5


model = AutoModelForTokenClassification.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
tokenizer = AutoTokenizer.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
def inference_fn(sentence):
    sentence = normalize(sentence)
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence,return_tensors="pt")
    outputs = model(inputs).logits
    predictions = torch.argmax(outputs[0],axis=1)[1:-1].numpy()
    idxs = np.where(predictions==1)

    return np.array(tokens)[idxs]

sentence = "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।"
pred = inference_fn(sentence)
print(f"Input Sentence : {sentence}")
print(f"Person Name Entities : {pred}")

sentence = "ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'"
pred = inference_fn(sentence)
print(f"Input Sentence : {sentence}")
print(f"Person Name Entities : {pred}")


sentence = "দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।"
pred = inference_fn(sentence)
print(f"Input Sentence : {sentence}")
print(f"Person Name Entities : {pred}")

Output:

Input Sentence : আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।
Person Name Entities : ['আব্দুর' 'রহিম']


Input Sentence : ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'
Person Name Entities : ['দেলোয়ার' 'হোসেন' 'মজুমদার']


Input Sentence : দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।
Person Name Entities : []

Datasets

We used two datasets to train and evaluate our pipeline.

The annotation formats for both datasets were quite different, so we had to preprocess both of them before merging them. Please refer to this notebook for preparing the dataset as required.

Training and Evaluation

We treated this problem as a token classification task.So it seemed perfect to finetune the BanglaBERT model for our purpose. BanglaBERT is an ELECTRA discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali. We mainly finetuned two checkpoints of BanglaBERT.

BanglaBERT performed better than BanglaBERT small ( 83% F1 score vs 79% F1 score on the test set) . Please refer to this notebook to see the training process.

Quantitative results Please refer to this notebook to see the evaluation process.