BanglaBERT Crime Tagging — Multi-Task Model

A multi-task model fine-tuned on Bangla news articles for automated crime analysis. Built on top of csebuetnlp/banglabert (a Bangla ELECTRA-based encoder), it jointly predicts six structured outputs from a single article (headline + body).

Model Architecture

BanglaBERT (ELECTRA encoder, hidden=768)
│
├── [CLS] token → Dropout(0.1)
│   ├── crime_head    Linear(768 → 2)      is_crime
│   ├── event_head    Linear(768 → 49)    event type
│   ├── direct_head   Linear(768 → 2)      is_direct
│   └── origin_head   Linear(768 → 3)      origin
│
└── All tokens → Dropout(0.1)
    ├── loc_ner_head   Linear(768 → 3)     location BIO tags
    └── time_ner_head  Linear(768 → 3)     time-span BIO tags

Input: [CLS] headline [SEP] article content [SEP], max 256 tokens.

Output Heads

Head	Task type	Output
`is_crime`	Binary classification	`True` / `False`
`event`	Multi-class (49 classes)	event type string or `null`
`is_direct`	Binary classification	`True` / `False`
`origin`	3-class classification	`local` / `international` / `null`
`loc_ner`	Token classification (BIO)	`O`, `B-LOC`, `I-LOC`
`time_ner`	Token classification (BIO)	`O`, `B-TIME`, `I-TIME`

Training Details

Parameter	Value
Base model	`csebuetnlp/banglabert`
Max sequence length	256
Batch size	16
Learning rate	2e-5
Warmup ratio	0.1
Weight decay	0.01
Loss (event head)	Focal loss (γ=2.0)
Loss (other heads)	Class-weighted cross-entropy
Best val avg-F1	`N/A`
Early-stop epoch	`N/A`

Evaluation Results (Test Set)

Overall average macro-F1 across all heads: 0.8413

is_crime (Binary)

	Precision	Recall	F1	Support
not-crime	—	—	—	—
crime	—	—	—	—

origin (3-class)

Class	Precision	Recall	F1	Support
none	0.90	0.82	0.86	907
international	0.80	0.91	0.85	338
local	0.94	0.95	0.94	1873
macro avg	0.88	0.89	0.88	3118
weighted avg	0.91	0.91	0.91	3118

Accuracy: 0.91

event (48-class) — macro-F1: 0.7003 | weighted-F1: 0.7810

Event	Precision	Recall	F1	Support
none	0.91	0.79	0.85	724
armed_attack	0.60	0.60	0.60	40
arms_trafficking	0.75	0.55	0.63	11
arrest	0.80	0.84	0.82	187
arson	0.79	0.93	0.86	29
assault	0.80	0.81	0.80	122
attempted_murder	1.00	0.80	0.89	5
blockade	0.80	0.88	0.84	32
bribery	0.43	0.75	0.55	4
burglary	0.50	0.50	0.50	2
child_abuse	0.83	0.71	0.77	7
corruption	0.74	0.82	0.78	131
cybercrime	0.71	0.75	0.73	61
data_breach	0.71	0.71	0.71	7
drug_trafficking	0.71	0.77	0.74	31
fraud	0.76	0.73	0.75	119
gang_crime	0.80	0.50	0.62	16
hacking	0.25	0.20	0.22	5
human_chain	0.62	0.89	0.73	9
human_trafficking	0.88	0.70	0.78	20
identity_theft	0.00	0.00	0.00	0
kidnapping	0.86	0.86	0.86	21
legal_proceedings	0.80	0.79	0.79	200
looting	0.62	0.57	0.59	14
movement	0.23	1.00	0.38	3
murder	0.84	0.81	0.82	228
online_scam	0.29	0.25	0.27	8
organized_crime	0.00	0.00	0.00	0
other_crime	0.44	0.55	0.49	128
phishing	0.83	1.00	0.91	5
police_action	0.66	0.68	0.67	164
procession	0.75	0.67	0.71	9
protest_unrest	0.79	0.79	0.79	276
raid	0.83	0.91	0.87	22
rally	0.67	0.62	0.65	16
ransomware	1.00	0.50	0.67	2
rape	0.80	0.96	0.87	47
riot	0.41	0.47	0.44	15
robbery	0.80	0.88	0.84	58
sexual_harassment	0.70	0.83	0.76	52
shooting	0.65	0.85	0.74	41
sit_in	0.00	0.00	0.00	0
smuggling	0.83	0.94	0.88	36
snatching	0.85	0.85	0.85	33
stabbing	0.77	0.86	0.81	28
strike	0.92	0.96	0.94	25
terrorism	1.00	0.40	0.57	5
theft	0.78	0.84	0.81	45
vandalism	0.85	0.75	0.79	75
macro avg	0.68	0.69	0.67	3118
weighted avg	0.79	0.78	0.78	3118

Accuracy: 0.78

loc_ner (Location BIO) — macro-F1: 0.7743

Tag	Precision	Recall	F1	Support
O	1.00	0.99	0.99	164,441
B-LOC	0.61	0.90	0.72	1,691
I-LOC	0.49	0.79	0.60	661
macro avg	0.70	0.89	0.77	166,793
weighted avg	0.99	0.99	0.99	166,793

Accuracy: 0.99

time_ner (Time-span BIO) — macro-F1: 0.8499

Tag	Precision	Recall	F1	Support
O	1.00	1.00	1.00	164,825
B-TIME	0.67	0.90	0.77	795
I-TIME	0.69	0.90	0.78	1,173
macro avg	0.79	0.93	0.85	166,793
weighted avg	1.00	0.99	0.99	166,793

Accuracy: 0.99

Event Labels (49 classes)

[
  "none",
  "armed_attack",
  "arms_trafficking",
  "arrest",
  "arson",
  "assault",
  "attempted_murder",
  "blockade",
  "bribery",
  "burglary",
  "child_abuse",
  "corruption",
  "cybercrime",
  "data_breach",
  "drug_trafficking",
  "fraud",
  "gang_crime",
  "hacking",
  "human_chain",
  "human_trafficking",
  "identity_theft",
  "kidnapping",
  "legal_proceedings",
  "looting",
  "movement",
  "murder",
  "online_scam",
  "organized_crime",
  "other_crime",
  "phishing",
  "police_action",
  "procession",
  "protest_unrest",
  "raid",
  "rally",
  "ransomware",
  "rape",
  "riot",
  "robbery",
  "sexual_harassment",
  "shooting",
  "sit_in",
  "smuggling",
  "snatching",
  "stabbing",
  "strike",
  "terrorism",
  "theft",
  "vandalism"
]

Origin Labels

[
  "none",
  "international",
  "local"
]

Usage

Installation

pip install torch transformers huggingface_hub

Download model files

from huggingface_hub import hf_hub_download

# Download the checkpoint (contains all head weights + label metadata)
ckpt_path = hf_hub_download(repo_id="arafatfahim/crime-event-detection", filename="checkpoint.pt")

Full inference example

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download

REPO_ID = "arafatfahim/crime-event-detection"
DEVICE  = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MAX_LEN = 256


# ── 1. Recreate the model class ──────────────────────────────────────────────
class BanglaBertMultiTask(nn.Module):
    def __init__(self, bert, num_events, num_origins):
        super().__init__()
        self.bert    = bert
        hidden       = self.bert.config.hidden_size
        self.dropout = nn.Dropout(0.1)
        self.crime_head    = nn.Linear(hidden, 2)
        self.event_head    = nn.Linear(hidden, num_events)
        self.direct_head   = nn.Linear(hidden, 2)
        self.origin_head   = nn.Linear(hidden, num_origins)
        self.loc_ner_head  = nn.Linear(hidden, 3)
        self.time_ner_head = nn.Linear(hidden, 3)

    def forward(self, input_ids, attention_mask):
        out     = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        seq_out = self.dropout(out.last_hidden_state)
        cls     = seq_out[:, 0, :]
        return {
            "is_crime" : self.crime_head(cls),
            "event"    : self.event_head(cls),
            "is_direct": self.direct_head(cls),
            "origin"   : self.origin_head(cls),
            "loc_ner"  : self.loc_ner_head(seq_out),
            "time_ner" : self.time_ner_head(seq_out),
        }


# ── 2. Load checkpoint & tokenizer ──────────────────────────────────────────
ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="checkpoint.pt")
ckpt      = torch.load(ckpt_path, map_location=DEVICE)

event_labels  = ckpt["event_labels"]   # list of str
origin_labels = ckpt["origin_labels"]  # ["none", "international", "local"]

tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
bert      = AutoModel.from_pretrained(REPO_ID)

model = BanglaBertMultiTask(bert, ckpt["num_events"], ckpt["num_origins"]).to(DEVICE)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()


# ── 3. Predict ───────────────────────────────────────────────────────────────
headline = "ঢাকায় ছিনতাইয়ের ঘটনায় যুবক গ্রেপ্তার"
content  = "রাতে একটি মোটরসাইকেল থামিয়ে যাত্রীর মোবাইল ও টাকা ছিনিয়ে নেয় দুর্বৃত্তরা।"

enc = tokenizer(
    headline, content,
    max_length=MAX_LEN,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    return_offsets_mapping=True,
    return_token_type_ids=True,
)
offset_mapping = enc.pop("offset_mapping").squeeze(0).tolist()
token_type_ids = enc["token_type_ids"].squeeze(0).tolist()

with torch.no_grad():
    logits = model(enc["input_ids"].to(DEVICE), enc["attention_mask"].to(DEVICE))

# Classification heads
is_crime  = bool(logits["is_crime"].argmax(dim=-1).item())
is_direct = bool(logits["is_direct"].argmax(dim=-1).item())

event_idx  = logits["event"].argmax(dim=-1).item()
event      = event_labels[event_idx] if event_idx != 0 else None
event_conf = F.softmax(logits["event"], dim=-1).squeeze()[event_idx].item()

origin_idx = logits["origin"].argmax(dim=-1).item()
origin     = origin_labels[origin_idx] if origin_idx != 0 else None

# NER heads — decode BIO spans from token predictions
def decode_bio(preds, offsets, type_ids, texts):
    spans, current = [], []
    for pred, (s, e), tid in zip(preds, offsets, type_ids):
        text = texts[tid] if tid < len(texts) else ""
        if pred == 1:
            if current: spans.append("".join(current))
            current = [] if (s == 0 and e == 0) else [text[s:e]]
        elif pred == 2 and current and not (s == 0 and e == 0):
            current.append(text[s:e])
        else:
            if current: spans.append("".join(current)); current = []
    if current: spans.append("".join(current))
    return list(dict.fromkeys(spans))   # deduplicate, preserve order

loc_preds  = logits["loc_ner"].squeeze(0).argmax(dim=-1).tolist()
time_preds = logits["time_ner"].squeeze(0).argmax(dim=-1).tolist()
locations  = decode_bio(loc_preds,  offset_mapping, token_type_ids, [headline, content])
time_spans = decode_bio(time_preds, offset_mapping, token_type_ids, [headline, content])

print({
    "is_crime"      : is_crime,
    "event"         : event,
    "event_conf"    : round(event_conf, 4),
    "is_direct"     : is_direct,
    "origin"        : origin,
    "locations"     : locations,
    "event_occurred": time_spans[0] if time_spans else None,
})

Expected output structure

{
  "is_crime"      : true,
  "event"         : "theft",
  "event_conf"    : 0.9132,
  "is_direct"     : true,
  "origin"        : "local",
  "locations"     : ["ঢাকা"],
  "event_occurred": null
}

Files in this repository

File	Description
`config.json`	BERT encoder config (ELECTRA architecture)
`model.safetensors`	BERT encoder weights
`tokenizer_config.json` / `tokenizer.json`	Tokenizer files
`checkpoint.pt`	Full model weights (all 6 heads) + label metadata

Note: checkpoint.pt is required to restore the classification/NER heads. The config.json + model.safetensors files only contain the shared BERT encoder.

Downloads last month: 21

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for arafatfahim/crime-event-detection

Base model

csebuetnlp/banglabert

Finetuned

(26)

this model