TinyBERT (4L/312D) β Indian Address Parser
A full fine-tune of huawei-noah/TinyBERT_General_4L_312D that parses raw, unstructured Indian address strings into 13 structured fields via token classification (BIO tagging) β the smallest and fastest of this project's three models. ~14M params, vs. ~77M for flan-t5-small and ~596M for Qwen3-0.6B β yet scores within ~2-4 points of both on mean field accuracy (see below).
Input: "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
"street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
"village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
"state": "AS", "pincode": "781029"}
Unlike the other two models (which generate a JSON string), this model predicts one BIO tag per token β it always produces a well-formed field dict; there's no "invalid JSON" failure mode to handle downstream.
Usage
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
repo = "gagan1985/tinybert-4l-312d-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForTokenClassification.from_pretrained(repo)
address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
enc = tokenizer(address, return_tensors="pt", return_offsets_mapping=True, truncation=True, max_length=160)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
logits = model(**enc).logits
pred_ids = logits[0].argmax(-1).tolist()
# reconstruct fields from raw text spans (see inference_tinybert.py's extract_fields
# for the full implementation) β never tokenizer.decode, which would lose casing
# and introduce WordPiece "##"-continuation artifacts
Or use the included inference_tinybert.py / evaluate_tinybert.py (download
the repo, then python inference_tinybert.py --model . "<address>"). Both are
standalone β only transformers+torch required.
Also available
- gagan1985/flan-t5-small-indian-address-parser β encoder-decoder, ~77M params, higher accuracy
- gagan1985/qwen3-0.6b-indian-address-parser β causal LM + LoRA, ~596M params, the most accurate of the three
pip install indian-address-parserβ all three models behind one interface (AddressParser(backend="tinybert")), source and benchmarks on GitHub: innerkorehq/indian-address-parser
Datasets
- gagan1985/indian-addresses-gold β the gold-labeled training data behind this model
- gagan1985/indian-addresses-raw β the 4.37M-record raw, unlabeled corpus this gold set was drawn from
Fields
houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode
Training data
- 4,110 train / 228 val / 228 test (same split as the t5/qwen models β deduplicated gold-labeled records)
- Gold labels are always copied verbatim from the raw address text β never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it.
- BIO conversion: this project's gold labels are JSON field->value pairs, not
token-level tags. Since gold values are (almost always) verbatim substrings
of the raw text, they convert cleanly back to character spans, then to
per-token BIO labels via the tokenizer's offset mapping. This has a measured
ceiling: 96.68% of gold field values round-trip exactly through
spans->BIO->reconstruction on the training set. The gap is two known,
unavoidable cases β (1) two different fields sharing the same value with
too few distinct occurrences in the text to assign each its own span, and
(2) a token straddling a span boundary on an already-documented data
artifact (glued/duplicated substrings from the source MCA records, e.g.
"PEDANANDIPALLE AGRAHARAMPEDANANDIPALLE AGRAHARAM") β not a training bug. Seebio_convert.pyon GitHub for the exact algorithm.
Training config
| Parameter | Value |
|---|---|
| Base model | huawei-noah/TinyBERT_General_4L_312D (~14M params, 4 layers, 312 hidden) |
| Fine-tuning | Full fine-tune, token classification head |
| Labels | 27 (O + B-/I- Γ 13 fields) |
| Epochs | 10 |
| Batch size | 32 |
| Learning rate | 5e-5 |
| Training time | ~2.2 minutes on Apple Silicon (MPS) β the tiny model size makes this by far the cheapest of the three to train |
Evaluation (228 held-out test samples)
- Overall exact match (all present fields): 10.1%
- Mean per-field accuracy: 78.8% (vs. 80.6% for flan-t5-small, 82.4% for Qwen3-0.6B β a 2-4 point gap for a model 5-40x smaller)
| Field | Accuracy | Recall | Gold presence |
|---|---|---|---|
| pincode | 99.6% | 99.6% | 100.0% |
| district | 92.5% | 92.7% | 65.8% |
| subDistrict | 88.2% | 4.3% | 10.1% |
| houseNumber | 86.8% | 79.8% | 52.2% |
| state | 83.8% | 83.6% | 98.7% |
| houseName | 83.3% | 86.0% | 43.9% |
| city | 82.5% | 77.9% | 67.5% |
| poi | 78.1% | 18.6% | 18.9% |
| village | 70.6% | 0.0% | 22.4% |
| subsubLocality | 70.6% | 43.5% | 27.2% |
| subLocality | 76.8% | 0.0% | 21.9% |
| street | 56.1% | 49.6% | 53.9% |
| locality | 55.7% | 30.5% | 41.7% |
Known limitations
villageandsubLocalityhave 0% recall despite reasonably high raw accuracy β the model defaults tonullfor these far more often than gold does, so its "correct" score comes mostly from gold also being null, not from successfully extracting the field when present. Same pattern aspoion the flan-t5-small model card, and a plausible consequence of this model's much smaller capacity (4 layers, 312 hidden) having less room for the rarer, more context-dependent fields.- Same
locality/subLocality/subsubLocality/villageconceptual overlap noted on the other two model cards applies here too. - As a BERT-style encoder, this model is architecturally unable to see beyond 512 tokens and (like all WordPiece BERT tokenizers) lowercases input before tokenizing β case information itself isn't used for classification, though original casing is preserved in the output since fields are extracted from the raw text by character offset, not from decoded tokens.
Administrative fields via pincode gazetteer (optional, additive)
Same non-invasive wrapper as the other two models β see
gazetteer_lookup.py (included). Adds districtAdministrative/
stateAdministrative/cityAdministrative as new, always-populated fields
without touching this model's own verbatim district/state/city output.
On the 228-sample test set: 99.6% administrative-field coverage, with zero
impact on the field accuracy numbers above.
from gazetteer_lookup import load_gazetteer, add_administrative_fields
gazetteer = load_gazetteer("pincodes.csv") # India Post format
result = add_administrative_fields(model_output, gazetteer)
Files in this repo
model.safetensors,config.jsonβ the fine-tuned modeltokenizer.json,tokenizer_config.jsonβ tokenizerinference_tinybert.py/evaluate_tinybert.pyβ standalone CLI scripts, only needtransformers+torchgazetteer_lookup.pyβ optional pincode β district/state/city lookup wrapper (needs a pincodes CSV, not included here β see the GitHub repo for the source used)
License
Apache 2.0, inherited from the base model (huawei-noah/TinyBERT_General_4L_312D).
- Downloads last month
- -
Model tree for gagan1985/tinybert-4l-312d-indian-address-parser
Base model
huawei-noah/TinyBERT_General_4L_312D