Bangla NER — Named Entity Recognition for Bengali

A fine-tuned token classification model for Bengali (Bangla) Named Entity Recognition using the BIO tagging scheme. Built on top of csebuetnlp/banglabert (ELECTRA-based).

Entity Types

Tag	Description	Example
`PER`	Person names	একেএম শহীদুল হক
`LOC`	Locations, cities, countries	বাংলাদেশ, ঢাকা
`ORG`	Organizations, companies	টুইটার, রিয়াল মাদ্রিদ
`POL`	Political entities / parties	আওয়ামী লীগ
`DATE`	Calendar dates	সোমবার, ২০২৪ সালে
`TIME`	Times of day	সকাল ৮টায়
`EVENT`	Named events	রোহিঙ্গা সঙ্কট
`CRIME`	Crime-related entities	হত্যা মামলা
`TITLE`	Titles, designations	মহাপরিদর্শক
`NUM`	Numbers, quantities	৯৩ শতাংশ
`SYMBOL`	Symbols, currencies	৳, %
`CONSTITUENCY`	Electoral constituencies	ঢাকা-১৮
`INST`	Institutions	তথ্য অধিদপ্তর

All tags follow BIO format: B- (beginning), I- (inside), O (outside).

Training Details

Parameter	Value
Base model	`csebuetnlp/banglabert`
Architecture	ELECTRA (discriminator)
Task	Token Classification (NER)
Dataset size	22,144 sentences
Train split	85% (18,822)
Validation split	7.5% (1,661)
Test split	7.5% (1,661)
Max sequence length	256 tokens
Batch size	16
Epochs	8 (early stopping, patience=2)
Best epoch	7
Learning rate	2e-5
LR scheduler	Linear with warmup
Warmup steps	10% of total steps
Weight decay	0.01
Optimizer	AdamW
Mixed precision	fp16
Framework	PyTorch + HuggingFace Transformers
Hardware	NVIDIA GeForce RTX 4070 Ti SUPER (16 GB)

Test Set Results (Overall)

Metric	Score
F1	74.93%
Precision	75.82%
Recall	74.06%
Token Accuracy	93.41%

Per-Entity Results (Test Set)

Entity	Precision	Recall	F1	Support
CONSTITUENCY	0.8333	0.7500	0.7895	20
CRIME	0.9489	0.9489	0.9489	137
DATE	0.7730	0.7552	0.7640	478
EVENT	0.6827	0.6514	0.6667	109
INST	0.7119	0.7636	0.7368	55
LOC	0.7451	0.7245	0.7347	795
NUM	0.6949	0.8913	0.7810	46
ORG	0.5617	0.5686	0.5652	408
PER	0.7654	0.7260	0.7452	719
POL	0.8182	0.8333	0.8257	54
SYMBOL	1.0000	0.8750	0.9333	8
TIME	0.9839	0.8472	0.9104	144
TITLE	0.9532	0.9645	0.9588	169
micro avg	0.7582	0.7406	0.7493	3142
macro avg	0.8056	0.7923	0.7969	3142

Usage

With pipeline (recommended)

from transformers import pipeline

ner = pipeline(
    "ner",
    model="arafatfahim/BanglaTag",
    aggregation_strategy="simple",
)

text = "একেএম শহীদুল হক বাংলাদেশে কক্সবাজার এলাকায় সোমবার সংবাদ সম্মেলন করেন"
print(ner(text))

Manual inference

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "arafatfahim/BanglaTag"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

tokens = ["একেএম", "শহীদুল", "হক", "বাংলাদেশে", "এসেছেন"]
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predictions = logits.argmax(-1)[0]
word_ids = inputs.word_ids()

prev = None
for word_id, pred_id in zip(word_ids, predictions):
    if word_id is None or word_id == prev:
        continue
    print(f"{tokens[word_id]:20s} → {model.config.id2label[pred_id.item()]}")
    prev = word_id

Citation

If you use this model, please cite:

@misc{bangla-ner-2026,
  title  = {Bangla NER: Fine-tuned BanglaBERT for Bengali Named Entity Recognition},
  year   = {2026},
  url    = {https://huggingface.co/arafatfahim/BanglaTag}
}

Downloads last month: 14

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for arafatfahim/BanglaTag

Base model

csebuetnlp/banglabert

Finetuned

(26)

this model

Evaluation results

f1
self-reported

0.749
precision
self-reported

0.758
recall
self-reported

0.741
accuracy
self-reported

0.934