TinyBERT (4L/312D) β€” Indian Address Parser

A full fine-tune of huawei-noah/TinyBERT_General_4L_312D that parses raw, unstructured Indian address strings into 13 structured fields via token classification (BIO tagging) β€” the smallest and fastest of this project's three models. ~14M params, vs. ~77M for flan-t5-small and ~596M for Qwen3-0.6B β€” yet scores within ~2-4 points of both on mean field accuracy (see below).

Input:  "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
         "street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
         "village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
         "state": "AS", "pincode": "781029"}

Unlike the other two models (which generate a JSON string), this model predicts one BIO tag per token β€” it always produces a well-formed field dict; there's no "invalid JSON" failure mode to handle downstream.

Usage

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

repo = "gagan1985/tinybert-4l-312d-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)

address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
enc = tokenizer(address, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=160)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    logits = model(**enc).logits
pred_ids = logits[0].argmax(-1).tolist()

# reconstruct fields from raw text spans (see inference_tinybert.py's extract_fields
# for the full implementation) β€” never tokenizer.decode, which would lose casing
# and introduce WordPiece "##"-continuation artifacts

Or use the included inference_tinybert.py / evaluate_tinybert.py (download the repo, then python inference_tinybert.py --model . "<address>"). Both are standalone β€” only transformers+torch required.

Also available

Datasets

Fields

houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode

Training data

  • 4,110 train / 228 val / 228 test (same split as the t5/qwen models β€” deduplicated gold-labeled records)
  • Gold labels are always copied verbatim from the raw address text β€” never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it.
  • BIO conversion: this project's gold labels are JSON field->value pairs, not token-level tags. Since gold values are (almost always) verbatim substrings of the raw text, they convert cleanly back to character spans, then to per-token BIO labels via the tokenizer's offset mapping. This has a measured ceiling: 96.68% of gold field values round-trip exactly through spans->BIO->reconstruction on the training set. The gap is two known, unavoidable cases β€” (1) two different fields sharing the same value with too few distinct occurrences in the text to assign each its own span, and (2) a token straddling a span boundary on an already-documented data artifact (glued/duplicated substrings from the source MCA records, e.g. "PEDANANDIPALLE AGRAHARAMPEDANANDIPALLE AGRAHARAM") β€” not a training bug. See bio_convert.py on GitHub for the exact algorithm.

Training config

Parameter Value
Base model huawei-noah/TinyBERT_General_4L_312D (~14M params, 4 layers, 312 hidden)
Fine-tuning Full fine-tune, token classification head
Labels 27 (O + B-/I- Γ— 13 fields)
Epochs 10
Batch size 32
Learning rate 5e-5
Training time ~2.2 minutes on Apple Silicon (MPS) β€” the tiny model size makes this by far the cheapest of the three to train

Evaluation (228 held-out test samples)

  • Overall exact match (all present fields): 10.1%
  • Mean per-field accuracy: 78.8% (vs. 80.6% for flan-t5-small, 82.4% for Qwen3-0.6B β€” a 2-4 point gap for a model 5-40x smaller)
Field Accuracy Recall Gold presence
pincode 99.6% 99.6% 100.0%
district 92.5% 92.7% 65.8%
subDistrict 88.2% 4.3% 10.1%
houseNumber 86.8% 79.8% 52.2%
state 83.8% 83.6% 98.7%
houseName 83.3% 86.0% 43.9%
city 82.5% 77.9% 67.5%
poi 78.1% 18.6% 18.9%
village 70.6% 0.0% 22.4%
subsubLocality 70.6% 43.5% 27.2%
subLocality 76.8% 0.0% 21.9%
street 56.1% 49.6% 53.9%
locality 55.7% 30.5% 41.7%

Known limitations

  • village and subLocality have 0% recall despite reasonably high raw accuracy β€” the model defaults to null for these far more often than gold does, so its "correct" score comes mostly from gold also being null, not from successfully extracting the field when present. Same pattern as poi on the flan-t5-small model card, and a plausible consequence of this model's much smaller capacity (4 layers, 312 hidden) having less room for the rarer, more context-dependent fields.
  • Same locality/subLocality/subsubLocality/village conceptual overlap noted on the other two model cards applies here too.
  • As a BERT-style encoder, this model is architecturally unable to see beyond 512 tokens and (like all WordPiece BERT tokenizers) lowercases input before tokenizing β€” case information itself isn't used for classification, though original casing is preserved in the output since fields are extracted from the raw text by character offset, not from decoded tokens.

Administrative fields via pincode gazetteer (optional, additive)

Same non-invasive wrapper as the other two models β€” see gazetteer_lookup.py (included). Adds districtAdministrative/ stateAdministrative/cityAdministrative as new, always-populated fields without touching this model's own verbatim district/state/city output. On the 228-sample test set: 99.6% administrative-field coverage, with zero impact on the field accuracy numbers above.

from gazetteer_lookup import load_gazetteer, add_administrative_fields

gazetteer = load_gazetteer("pincodes.csv")  # India Post format
result = add_administrative_fields(model_output, gazetteer)

Files in this repo

  • model.safetensors, config.json β€” the fine-tuned model
  • tokenizer.json, tokenizer_config.json β€” tokenizer
  • inference_tinybert.py / evaluate_tinybert.py β€” standalone CLI scripts, only need transformers+torch
  • gazetteer_lookup.py β€” optional pincode β†’ district/state/city lookup wrapper (needs a pincodes CSV, not included here β€” see the GitHub repo for the source used)

License

Apache 2.0, inherited from the base model (huawei-noah/TinyBERT_General_4L_312D).

Downloads last month
-
Safetensors
Model size
14.3M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gagan1985/tinybert-4l-312d-indian-address-parser

Finetuned
(59)
this model