Bangla NER — Named Entity Recognition for Bengali

A fine-tuned token classification model for Bengali (Bangla) Named Entity Recognition using the BIO tagging scheme. Built on top of csebuetnlp/banglabert (ELECTRA-based).


Entity Types

Tag Description Example
PER Person names একেএম শহীদুল হক
LOC Locations, cities, countries বাংলাদেশ, ঢাকা
ORG Organizations, companies টুইটার, রিয়াল মাদ্রিদ
POL Political entities / parties আওয়ামী লীগ
DATE Calendar dates সোমবার, ২০২৪ সালে
TIME Times of day সকাল ৮টায়
EVENT Named events রোহিঙ্গা সঙ্কট
CRIME Crime-related entities হত্যা মামলা
TITLE Titles, designations মহাপরিদর্শক
NUM Numbers, quantities ৯৩ শতাংশ
SYMBOL Symbols, currencies ৳, %
CONSTITUENCY Electoral constituencies ঢাকা-১৮
INST Institutions তথ্য অধিদপ্তর

All tags follow BIO format: B- (beginning), I- (inside), O (outside).


Training Details

Parameter Value
Base model csebuetnlp/banglabert
Architecture ELECTRA (discriminator)
Task Token Classification (NER)
Dataset size 22,144 sentences
Train split 85% (18,822)
Validation split 7.5% (1,661)
Test split 7.5% (1,661)
Max sequence length 256 tokens
Batch size 16
Epochs 8 (early stopping, patience=2)
Best epoch 7
Learning rate 2e-5
LR scheduler Linear with warmup
Warmup steps 10% of total steps
Weight decay 0.01
Optimizer AdamW
Mixed precision fp16
Framework PyTorch + HuggingFace Transformers
Hardware NVIDIA GeForce RTX 4070 Ti SUPER (16 GB)

Test Set Results (Overall)

Metric Score
F1 74.93%
Precision 75.82%
Recall 74.06%
Token Accuracy 93.41%

Per-Entity Results (Test Set)

Entity Precision Recall F1 Support
CONSTITUENCY 0.8333 0.7500 0.7895 20
CRIME 0.9489 0.9489 0.9489 137
DATE 0.7730 0.7552 0.7640 478
EVENT 0.6827 0.6514 0.6667 109
INST 0.7119 0.7636 0.7368 55
LOC 0.7451 0.7245 0.7347 795
NUM 0.6949 0.8913 0.7810 46
ORG 0.5617 0.5686 0.5652 408
PER 0.7654 0.7260 0.7452 719
POL 0.8182 0.8333 0.8257 54
SYMBOL 1.0000 0.8750 0.9333 8
TIME 0.9839 0.8472 0.9104 144
TITLE 0.9532 0.9645 0.9588 169
micro avg 0.7582 0.7406 0.7493 3142
macro avg 0.8056 0.7923 0.7969 3142

Usage

With pipeline (recommended)

from transformers import pipeline

ner = pipeline(
    "ner",
    model="arafatfahim/BanglaTag",
    aggregation_strategy="simple",
)

text = "একেএম শহীদুল হক বাংলাদেশে কক্সবাজার এলাকায় সোমবার সংবাদ সম্মেলন করেন"
print(ner(text))

Manual inference

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "arafatfahim/BanglaTag"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

tokens = ["একেএম", "শহীদুল", "হক", "বাংলাদেশে", "এসেছেন"]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predictions = logits.argmax(-1)[0]
word_ids = inputs.word_ids()

prev = None
for word_id, pred_id in zip(word_ids, predictions):
    if word_id is None or word_id == prev:
        continue
    print(f"{tokens[word_id]:20s}{model.config.id2label[pred_id.item()]}")
    prev = word_id

Citation

If you use this model, please cite:

@misc{bangla-ner-2026,
  title  = {Bangla NER: Fine-tuned BanglaBERT for Bengali Named Entity Recognition},
  year   = {2026},
  url    = {https://huggingface.co/arafatfahim/BanglaTag}
}
Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arafatfahim/BanglaTag

Finetuned
(26)
this model

Evaluation results