flan-t5-small — Indian Address Parser

A full fine-tune of google/flan-t5-small that parses raw, unstructured Indian address strings into 13 structured fields, output as JSON. This is a smaller, encoder-decoder alternative to gagan1985/qwen3-0.6b-indian-address-parser (~77M trainable params here vs. a 596M-param causal LM there) — useful where a compact, CPU-friendly model matters more than the last few points of accuracy.

Input:  "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
         "street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
         "village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
         "state": "AS", "pincode": "781029"}

Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

repo = "gagan1985/flan-t5-small-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)

address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
inputs = tokenizer("parse indian address: " + address, return_tensors="pt")
out = model.generate(**inputs, max_length=200, num_beams=1, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

No chat template or system prompt — this is a plain seq2seq model, prompted with a short task prefix ("parse indian address: ") the way T5 models expect.

Or use the included inference_t5.py / evaluate_t5.py (download the repo, then python inference_t5.py --model . "<address>"). Both are standalone — only transformers+torch required.

Note on a load-time warning: you'll see "...specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints with different values, so we will NOT tie them." on load. This is expected and harmless — the two weight matrices are genuinely different (the base checkpoint's config claims tied embeddings but this fine-tune's config.json correctly overrides that to tie_word_embeddings: false); transformers loads both real tensors correctly despite the noisy warning message.

Also available

gagan1985/qwen3-0.6b-indian-address-parser — larger causal-LM alternative, LoRA fine-tuned, ~2 points higher mean field accuracy
pip install indian-address-parser — the Qwen3 model packaged as a Python library, source and a benchmark against shiprocket-ai/open-tinybert-indian-address-ner on GitHub: innerkorehq/indian-address-parser

Datasets

gagan1985/indian-addresses-gold — the gold-labeled training data behind this model
gagan1985/indian-addresses-raw — the 4.37M-record raw, unlabeled corpus this gold set was drawn from

Fields

houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode

Training data

4,566 unique gold-labeled records (deduplicated), split 4,110 train / 228 val / 228 test
Gold labels are always copied verbatim from the raw address text — never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it, not because the true value is unknown.
Sourced from two distinct raw formats: Indian MCA (Ministry of Corporate Affairs) company-registration addresses, and bank/business-correspondent branch addresses

Training config

Parameter	Value
Base model	google/flan-t5-small (~77M params)
Fine-tuning	Full fine-tune (not LoRA — at this scale full fine-tuning is cheap and simpler, and typically matches or beats LoRA)
Epochs	8
Effective batch size	16 (per-device 4 × gradient accumulation 4)
Learning rate	3e-4
Gradient checkpointing	Enabled
Max input / target length	160 / 200 tokens

Notable fixes required during training

Vocabulary gap: flan-t5-small's SentencePiece tokenizer has no token at all for { or } — both map to <unk>, making valid JSON output architecturally impossible out of the box. Fixed by tokenizer.add_tokens(["{", "}"]) + model.resize_token_embeddings(...), sized off the model's actual embedding row count rather than len(tokenizer) (the two disagreed by 28 rows in the base checkpoint — resizing to len(tokenizer) would have silently truncated real rows).
Tied-embeddings mismatch: the base checkpoint's config claims shared.weight and lm_head.weight are tied, but the actual tensors differ. Left uncorrected, safetensors serialization trusts the config flag, deduplicates the tensors on save, and randomizes them back apart on reload. Fixed by explicitly setting model.config.tie_word_embeddings = False before saving.

Evaluation (228 held-out test samples)

JSON parse rate: 100%
Overall exact match (all present fields): 12.3%
Mean per-field accuracy: 80.6%

Field	Accuracy	Recall	Gold presence
pincode	97.8%	97.8%	100.0%
state	94.3%	95.6%	98.7%
district	93.9%	94.7%	65.8%
city	89.0%	87.0%	67.5%
subDistrict	88.2%	13.0%	10.1%
houseNumber	87.3%	83.2%	52.2%
houseName	81.6%	85.0%	43.9%
poi	80.7%	0.0%	18.9%
subsubLocality	72.4%	30.6%	27.2%
subLocality	71.1%	14.0%	21.9%
village	66.2%	35.3%	22.4%
street	68.9%	50.4%	53.9%
locality	56.6%	27.4%	41.7%

Known limitations

Lower ceiling than the Qwen3-0.6B model (~80.6% vs. ~82.4% mean field accuracy) — expected given ~8x fewer parameters; the gap is concentrated in the low-recall fields (poi, subDistrict, subLocality) where the larger model's extra capacity helps most with rare/ambiguous patterns.
Same locality/subLocality/subsubLocality/village conceptual overlap noted on the Qwen3 model card applies here too — these represent the same "named area, different granularity" concept, and the gold labels themselves are sometimes inconsistent about which bucket a place name belongs in.
poi has 0% recall despite 80.7% accuracy: the model defaults to null for this field far more often than gold does, so it's "correct" mostly by gold also being null, not by successfully extracting POIs when present.

Administrative fields via pincode gazetteer (optional, additive)

gazetteer_lookup.py (included) can look up a pincode against an India Post pincode CSV and add three new, always-populated fields — districtAdministrative, stateAdministrative, cityAdministrative — without touching the model's own verbatim district/state/city output. This is deliberately additive rather than a correction: overwriting or filling the model's fields with gazetteer values was tried and measured to regress accuracy, because gold labels are verbatim-from-text (see "Training data" above) and a pincode lookup always has an answer even when the source text — and correctly, the gold label — doesn't mention one.

from gazetteer_lookup import load_gazetteer, add_administrative_fields

gazetteer = load_gazetteer("pincodes.csv")  # India Post format
result = add_administrative_fields(model_output, gazetteer)

On the 228-sample test set: 98.7% administrative-field coverage, with zero impact on the original field accuracy numbers above (verified by re-running evaluation with and without the wrapper).

Files in this repo

model.safetensors, config.json, generation_config.json — the fine-tuned model
tokenizer.json, tokenizer_config.json — tokenizer (includes the added {/} tokens)
inference_t5.py / evaluate_t5.py — standalone CLI scripts (single address, stdin/file batch, or full test-set evaluation), only need transformers+torch
gazetteer_lookup.py — optional pincode → district/state/city lookup wrapper (needs a pincodes CSV, not included here — see the GitHub repo for the source used)

License

Apache 2.0, inherited from the base model (google/flan-t5-small).

Downloads last month: -

Safetensors

Model size

77M params

Tensor type

F32

Model tree for gagan1985/flan-t5-small-indian-address-parser

Base model

google/flan-t5-small

Finetuned

(503)

this model