BanglaBERT Crime Tagging — Multi-Task Model
A multi-task model fine-tuned on Bangla news articles for automated crime analysis.
Built on top of csebuetnlp/banglabert
(a Bangla ELECTRA-based encoder), it jointly predicts six structured outputs from
a single article (headline + body).
Model Architecture
BanglaBERT (ELECTRA encoder, hidden=768)
│
├── [CLS] token → Dropout(0.1)
│ ├── crime_head Linear(768 → 2) is_crime
│ ├── event_head Linear(768 → 49) event type
│ ├── direct_head Linear(768 → 2) is_direct
│ └── origin_head Linear(768 → 3) origin
│
└── All tokens → Dropout(0.1)
├── loc_ner_head Linear(768 → 3) location BIO tags
└── time_ner_head Linear(768 → 3) time-span BIO tags
Input: [CLS] headline [SEP] article content [SEP], max 256 tokens.
Output Heads
| Head | Task type | Output |
|---|---|---|
is_crime |
Binary classification | True / False |
event |
Multi-class (49 classes) | event type string or null |
is_direct |
Binary classification | True / False |
origin |
3-class classification | local / international / null |
loc_ner |
Token classification (BIO) | O, B-LOC, I-LOC |
time_ner |
Token classification (BIO) | O, B-TIME, I-TIME |
Training Details
| Parameter | Value |
|---|---|
| Base model | csebuetnlp/banglabert |
| Max sequence length | 256 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Loss (event head) | Focal loss (γ=2.0) |
| Loss (other heads) | Class-weighted cross-entropy |
| Best val avg-F1 | N/A |
| Early-stop epoch | N/A |
Evaluation Results (Test Set)
Overall average macro-F1 across all heads: 0.8413
is_crime (Binary)
| Precision | Recall | F1 | Support | |
|---|---|---|---|---|
| not-crime | — | — | — | — |
| crime | — | — | — | — |
origin (3-class)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| none | 0.90 | 0.82 | 0.86 | 907 |
| international | 0.80 | 0.91 | 0.85 | 338 |
| local | 0.94 | 0.95 | 0.94 | 1873 |
| macro avg | 0.88 | 0.89 | 0.88 | 3118 |
| weighted avg | 0.91 | 0.91 | 0.91 | 3118 |
Accuracy: 0.91
event (48-class) — macro-F1: 0.7003 | weighted-F1: 0.7810
| Event | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| none | 0.91 | 0.79 | 0.85 | 724 |
| armed_attack | 0.60 | 0.60 | 0.60 | 40 |
| arms_trafficking | 0.75 | 0.55 | 0.63 | 11 |
| arrest | 0.80 | 0.84 | 0.82 | 187 |
| arson | 0.79 | 0.93 | 0.86 | 29 |
| assault | 0.80 | 0.81 | 0.80 | 122 |
| attempted_murder | 1.00 | 0.80 | 0.89 | 5 |
| blockade | 0.80 | 0.88 | 0.84 | 32 |
| bribery | 0.43 | 0.75 | 0.55 | 4 |
| burglary | 0.50 | 0.50 | 0.50 | 2 |
| child_abuse | 0.83 | 0.71 | 0.77 | 7 |
| corruption | 0.74 | 0.82 | 0.78 | 131 |
| cybercrime | 0.71 | 0.75 | 0.73 | 61 |
| data_breach | 0.71 | 0.71 | 0.71 | 7 |
| drug_trafficking | 0.71 | 0.77 | 0.74 | 31 |
| fraud | 0.76 | 0.73 | 0.75 | 119 |
| gang_crime | 0.80 | 0.50 | 0.62 | 16 |
| hacking | 0.25 | 0.20 | 0.22 | 5 |
| human_chain | 0.62 | 0.89 | 0.73 | 9 |
| human_trafficking | 0.88 | 0.70 | 0.78 | 20 |
| identity_theft | 0.00 | 0.00 | 0.00 | 0 |
| kidnapping | 0.86 | 0.86 | 0.86 | 21 |
| legal_proceedings | 0.80 | 0.79 | 0.79 | 200 |
| looting | 0.62 | 0.57 | 0.59 | 14 |
| movement | 0.23 | 1.00 | 0.38 | 3 |
| murder | 0.84 | 0.81 | 0.82 | 228 |
| online_scam | 0.29 | 0.25 | 0.27 | 8 |
| organized_crime | 0.00 | 0.00 | 0.00 | 0 |
| other_crime | 0.44 | 0.55 | 0.49 | 128 |
| phishing | 0.83 | 1.00 | 0.91 | 5 |
| police_action | 0.66 | 0.68 | 0.67 | 164 |
| procession | 0.75 | 0.67 | 0.71 | 9 |
| protest_unrest | 0.79 | 0.79 | 0.79 | 276 |
| raid | 0.83 | 0.91 | 0.87 | 22 |
| rally | 0.67 | 0.62 | 0.65 | 16 |
| ransomware | 1.00 | 0.50 | 0.67 | 2 |
| rape | 0.80 | 0.96 | 0.87 | 47 |
| riot | 0.41 | 0.47 | 0.44 | 15 |
| robbery | 0.80 | 0.88 | 0.84 | 58 |
| sexual_harassment | 0.70 | 0.83 | 0.76 | 52 |
| shooting | 0.65 | 0.85 | 0.74 | 41 |
| sit_in | 0.00 | 0.00 | 0.00 | 0 |
| smuggling | 0.83 | 0.94 | 0.88 | 36 |
| snatching | 0.85 | 0.85 | 0.85 | 33 |
| stabbing | 0.77 | 0.86 | 0.81 | 28 |
| strike | 0.92 | 0.96 | 0.94 | 25 |
| terrorism | 1.00 | 0.40 | 0.57 | 5 |
| theft | 0.78 | 0.84 | 0.81 | 45 |
| vandalism | 0.85 | 0.75 | 0.79 | 75 |
| macro avg | 0.68 | 0.69 | 0.67 | 3118 |
| weighted avg | 0.79 | 0.78 | 0.78 | 3118 |
Accuracy: 0.78
loc_ner (Location BIO) — macro-F1: 0.7743
| Tag | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| O | 1.00 | 0.99 | 0.99 | 164,441 |
| B-LOC | 0.61 | 0.90 | 0.72 | 1,691 |
| I-LOC | 0.49 | 0.79 | 0.60 | 661 |
| macro avg | 0.70 | 0.89 | 0.77 | 166,793 |
| weighted avg | 0.99 | 0.99 | 0.99 | 166,793 |
Accuracy: 0.99
time_ner (Time-span BIO) — macro-F1: 0.8499
| Tag | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| O | 1.00 | 1.00 | 1.00 | 164,825 |
| B-TIME | 0.67 | 0.90 | 0.77 | 795 |
| I-TIME | 0.69 | 0.90 | 0.78 | 1,173 |
| macro avg | 0.79 | 0.93 | 0.85 | 166,793 |
| weighted avg | 1.00 | 0.99 | 0.99 | 166,793 |
Accuracy: 0.99
Event Labels (49 classes)
[
"none",
"armed_attack",
"arms_trafficking",
"arrest",
"arson",
"assault",
"attempted_murder",
"blockade",
"bribery",
"burglary",
"child_abuse",
"corruption",
"cybercrime",
"data_breach",
"drug_trafficking",
"fraud",
"gang_crime",
"hacking",
"human_chain",
"human_trafficking",
"identity_theft",
"kidnapping",
"legal_proceedings",
"looting",
"movement",
"murder",
"online_scam",
"organized_crime",
"other_crime",
"phishing",
"police_action",
"procession",
"protest_unrest",
"raid",
"rally",
"ransomware",
"rape",
"riot",
"robbery",
"sexual_harassment",
"shooting",
"sit_in",
"smuggling",
"snatching",
"stabbing",
"strike",
"terrorism",
"theft",
"vandalism"
]
Origin Labels
[
"none",
"international",
"local"
]
Usage
Installation
pip install torch transformers huggingface_hub
Download model files
from huggingface_hub import hf_hub_download
# Download the checkpoint (contains all head weights + label metadata)
ckpt_path = hf_hub_download(repo_id="arafatfahim/crime-event-detection", filename="checkpoint.pt")
Full inference example
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download
REPO_ID = "arafatfahim/crime-event-detection"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MAX_LEN = 256
# ── 1. Recreate the model class ──────────────────────────────────────────────
class BanglaBertMultiTask(nn.Module):
def __init__(self, bert, num_events, num_origins):
super().__init__()
self.bert = bert
hidden = self.bert.config.hidden_size
self.dropout = nn.Dropout(0.1)
self.crime_head = nn.Linear(hidden, 2)
self.event_head = nn.Linear(hidden, num_events)
self.direct_head = nn.Linear(hidden, 2)
self.origin_head = nn.Linear(hidden, num_origins)
self.loc_ner_head = nn.Linear(hidden, 3)
self.time_ner_head = nn.Linear(hidden, 3)
def forward(self, input_ids, attention_mask):
out = self.bert(input_ids=input_ids, attention_mask=attention_mask)
seq_out = self.dropout(out.last_hidden_state)
cls = seq_out[:, 0, :]
return {
"is_crime" : self.crime_head(cls),
"event" : self.event_head(cls),
"is_direct": self.direct_head(cls),
"origin" : self.origin_head(cls),
"loc_ner" : self.loc_ner_head(seq_out),
"time_ner" : self.time_ner_head(seq_out),
}
# ── 2. Load checkpoint & tokenizer ──────────────────────────────────────────
ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="checkpoint.pt")
ckpt = torch.load(ckpt_path, map_location=DEVICE)
event_labels = ckpt["event_labels"] # list of str
origin_labels = ckpt["origin_labels"] # ["none", "international", "local"]
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
bert = AutoModel.from_pretrained(REPO_ID)
model = BanglaBertMultiTask(bert, ckpt["num_events"], ckpt["num_origins"]).to(DEVICE)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# ── 3. Predict ───────────────────────────────────────────────────────────────
headline = "ঢাকায় ছিনতাইয়ের ঘটনায় যুবক গ্রেপ্তার"
content = "রাতে একটি মোটরসাইকেল থামিয়ে যাত্রীর মোবাইল ও টাকা ছিনিয়ে নেয় দুর্বৃত্তরা।"
enc = tokenizer(
headline, content,
max_length=MAX_LEN,
padding="max_length",
truncation=True,
return_tensors="pt",
return_offsets_mapping=True,
return_token_type_ids=True,
)
offset_mapping = enc.pop("offset_mapping").squeeze(0).tolist()
token_type_ids = enc["token_type_ids"].squeeze(0).tolist()
with torch.no_grad():
logits = model(enc["input_ids"].to(DEVICE), enc["attention_mask"].to(DEVICE))
# Classification heads
is_crime = bool(logits["is_crime"].argmax(dim=-1).item())
is_direct = bool(logits["is_direct"].argmax(dim=-1).item())
event_idx = logits["event"].argmax(dim=-1).item()
event = event_labels[event_idx] if event_idx != 0 else None
event_conf = F.softmax(logits["event"], dim=-1).squeeze()[event_idx].item()
origin_idx = logits["origin"].argmax(dim=-1).item()
origin = origin_labels[origin_idx] if origin_idx != 0 else None
# NER heads — decode BIO spans from token predictions
def decode_bio(preds, offsets, type_ids, texts):
spans, current = [], []
for pred, (s, e), tid in zip(preds, offsets, type_ids):
text = texts[tid] if tid < len(texts) else ""
if pred == 1:
if current: spans.append("".join(current))
current = [] if (s == 0 and e == 0) else [text[s:e]]
elif pred == 2 and current and not (s == 0 and e == 0):
current.append(text[s:e])
else:
if current: spans.append("".join(current)); current = []
if current: spans.append("".join(current))
return list(dict.fromkeys(spans)) # deduplicate, preserve order
loc_preds = logits["loc_ner"].squeeze(0).argmax(dim=-1).tolist()
time_preds = logits["time_ner"].squeeze(0).argmax(dim=-1).tolist()
locations = decode_bio(loc_preds, offset_mapping, token_type_ids, [headline, content])
time_spans = decode_bio(time_preds, offset_mapping, token_type_ids, [headline, content])
print({
"is_crime" : is_crime,
"event" : event,
"event_conf" : round(event_conf, 4),
"is_direct" : is_direct,
"origin" : origin,
"locations" : locations,
"event_occurred": time_spans[0] if time_spans else None,
})
Expected output structure
{
"is_crime" : true,
"event" : "theft",
"event_conf" : 0.9132,
"is_direct" : true,
"origin" : "local",
"locations" : ["ঢাকা"],
"event_occurred": null
}
Files in this repository
| File | Description |
|---|---|
config.json |
BERT encoder config (ELECTRA architecture) |
model.safetensors |
BERT encoder weights |
tokenizer_config.json / tokenizer.json |
Tokenizer files |
checkpoint.pt |
Full model weights (all 6 heads) + label metadata |
Note:
checkpoint.ptis required to restore the classification/NER heads. Theconfig.json+model.safetensorsfiles only contain the shared BERT encoder.
- Downloads last month
- 21
Model tree for arafatfahim/crime-event-detection
Base model
csebuetnlp/banglabert