TinyBERT (4L/312D) — Indian Address Parser

A full fine-tune of huawei-noah/TinyBERT_General_4L_312D that parses raw, unstructured Indian address strings into 13 structured fields via token classification (BIO tagging) — the smallest and fastest of this project's three models. ~14M params, vs. ~77M for flan-t5-small and ~596M for Qwen3-0.6B — yet scores within ~2-4 points of both on mean field accuracy (see below).

Input:  "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
         "street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
         "village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
         "state": "AS", "pincode": "781029"}

Unlike the other two models (which generate a JSON string), this model predicts one BIO tag per token — it always produces a well-formed field dict; there's no "invalid JSON" failure mode to handle downstream.

Usage

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

repo = "gagan1985/tinybert-4l-312d-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)

address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
enc = tokenizer(address, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=160)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    logits = model(**enc).logits
pred_ids = logits[0].argmax(-1).tolist()

# reconstruct fields from raw text spans (see inference_tinybert.py's extract_fields
# for the full implementation) — never tokenizer.decode, which would lose casing
# and introduce WordPiece "##"-continuation artifacts

Or use the included inference_tinybert.py / evaluate_tinybert.py (download the repo, then python inference_tinybert.py --model . "<address>"). Both are standalone — only transformers+torch required.

Also available

gagan1985/flan-t5-small-indian-address-parser — encoder-decoder, ~77M params, higher accuracy
gagan1985/qwen3-0.6b-indian-address-parser — causal LM + LoRA, ~596M params, the most accurate of the three
pip install indian-address-parser — all three models behind one interface (AddressParser(backend="tinybert")), source and benchmarks on GitHub: innerkorehq/indian-address-parser

Datasets

gagan1985/indian-addresses-gold — the gold-labeled training data behind this model
gagan1985/indian-addresses-raw — the 4.37M-record raw, unlabeled corpus this gold set was drawn from

Fields

houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode

Training data

4,110 train / 228 val / 228 test (same split as the t5/qwen models — deduplicated gold-labeled records)
Gold labels are always copied verbatim from the raw address text — never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it.
BIO conversion: this project's gold labels are JSON field->value pairs, not token-level tags. Since gold values are (almost always) verbatim substrings of the raw text, they convert cleanly back to character spans, then to per-token BIO labels via the tokenizer's offset mapping. This has a measured ceiling: 96.68% of gold field values round-trip exactly through spans->BIO->reconstruction on the training set. The gap is two known, unavoidable cases — (1) two different fields sharing the same value with too few distinct occurrences in the text to assign each its own span, and (2) a token straddling a span boundary on an already-documented data artifact (glued/duplicated substrings from the source MCA records, e.g. "PEDANANDIPALLE AGRAHARAMPEDANANDIPALLE AGRAHARAM") — not a training bug. See bio_convert.py on GitHub for the exact algorithm.

Training config

Parameter	Value
Base model	huawei-noah/TinyBERT_General_4L_312D (~14M params, 4 layers, 312 hidden)
Fine-tuning	Full fine-tune, token classification head
Labels	27 (`O` + `B-`/`I-` × 13 fields)
Epochs	10
Batch size	32
Learning rate	5e-5
Training time	~2.2 minutes on Apple Silicon (MPS) — the tiny model size makes this by far the cheapest of the three to train

Evaluation (228 held-out test samples)

Overall exact match (all present fields): 10.1%
Mean per-field accuracy: 78.8% (vs. 80.6% for flan-t5-small, 82.4% for Qwen3-0.6B — a 2-4 point gap for a model 5-40x smaller)

Field	Accuracy	Recall	Gold presence
pincode	99.6%	99.6%	100.0%
district	92.5%	92.7%	65.8%
subDistrict	88.2%	4.3%	10.1%
houseNumber	86.8%	79.8%	52.2%
state	83.8%	83.6%	98.7%
houseName	83.3%	86.0%	43.9%
city	82.5%	77.9%	67.5%
poi	78.1%	18.6%	18.9%
village	70.6%	0.0%	22.4%
subsubLocality	70.6%	43.5%	27.2%
subLocality	76.8%	0.0%	21.9%
street	56.1%	49.6%	53.9%
locality	55.7%	30.5%	41.7%

Known limitations

village and subLocality have 0% recall despite reasonably high raw accuracy — the model defaults to null for these far more often than gold does, so its "correct" score comes mostly from gold also being null, not from successfully extracting the field when present. Same pattern as poi on the flan-t5-small model card, and a plausible consequence of this model's much smaller capacity (4 layers, 312 hidden) having less room for the rarer, more context-dependent fields.
Same locality/subLocality/subsubLocality/village conceptual overlap noted on the other two model cards applies here too.
As a BERT-style encoder, this model is architecturally unable to see beyond 512 tokens and (like all WordPiece BERT tokenizers) lowercases input before tokenizing — case information itself isn't used for classification, though original casing is preserved in the output since fields are extracted from the raw text by character offset, not from decoded tokens.

Administrative fields via pincode gazetteer (optional, additive)

Same non-invasive wrapper as the other two models — see gazetteer_lookup.py (included). Adds districtAdministrative/ stateAdministrative/cityAdministrative as new, always-populated fields without touching this model's own verbatim district/state/city output. On the 228-sample test set: 99.6% administrative-field coverage, with zero impact on the field accuracy numbers above.

from gazetteer_lookup import load_gazetteer, add_administrative_fields

gazetteer = load_gazetteer("pincodes.csv")  # India Post format
result = add_administrative_fields(model_output, gazetteer)

Files in this repo

model.safetensors, config.json — the fine-tuned model
tokenizer.json, tokenizer_config.json — tokenizer
inference_tinybert.py / evaluate_tinybert.py — standalone CLI scripts, only need transformers+torch
gazetteer_lookup.py — optional pincode → district/state/city lookup wrapper (needs a pincodes CSV, not included here — see the GitHub repo for the source used)

License

Apache 2.0, inherited from the base model (huawei-noah/TinyBERT_General_4L_312D).

Downloads last month: -

Safetensors

Model size

14.3M params

Tensor type

F32

Model tree for gagan1985/tinybert-4l-312d-indian-address-parser

Base model

huawei-noah/TinyBERT_General_4L_312D

Finetuned

(59)

this model