flan-t5-small β Indian Address Parser
A full fine-tune of google/flan-t5-small that parses raw, unstructured Indian address strings into 13 structured fields, output as JSON. This is a smaller, encoder-decoder alternative to gagan1985/qwen3-0.6b-indian-address-parser (~77M trainable params here vs. a 596M-param causal LM there) β useful where a compact, CPU-friendly model matters more than the last few points of accuracy.
Input: "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
"street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
"village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
"state": "AS", "pincode": "781029"}
Usage
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
repo = "gagan1985/flan-t5-small-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)
address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
inputs = tokenizer("parse indian address: " + address, return_tensors="pt")
out = model.generate(**inputs, max_length=200, num_beams=1, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
No chat template or system prompt β this is a plain seq2seq model, prompted with a
short task prefix ("parse indian address: ") the way T5 models expect.
Or use the included inference_t5.py / evaluate_t5.py (download the repo, then
python inference_t5.py --model . "<address>"). Both are standalone β only
transformers+torch required.
Note on a load-time warning: you'll see
"...specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints with different values, so we will NOT tie them."on load. This is expected and harmless β the two weight matrices are genuinely different (the base checkpoint's config claims tied embeddings but this fine-tune'sconfig.jsoncorrectly overrides that totie_word_embeddings: false); transformers loads both real tensors correctly despite the noisy warning message.
Also available
- gagan1985/qwen3-0.6b-indian-address-parser β larger causal-LM alternative, LoRA fine-tuned, ~2 points higher mean field accuracy
pip install indian-address-parserβ the Qwen3 model packaged as a Python library, source and a benchmark against shiprocket-ai/open-tinybert-indian-address-ner on GitHub: innerkorehq/indian-address-parser
Datasets
- gagan1985/indian-addresses-gold β the gold-labeled training data behind this model
- gagan1985/indian-addresses-raw β the 4.37M-record raw, unlabeled corpus this gold set was drawn from
Fields
houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode
Training data
- 4,566 unique gold-labeled records (deduplicated), split 4,110 train / 228 val / 228 test
- Gold labels are always copied verbatim from the raw address text β never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it, not because the true value is unknown.
- Sourced from two distinct raw formats: Indian MCA (Ministry of Corporate Affairs) company-registration addresses, and bank/business-correspondent branch addresses
Training config
| Parameter | Value |
|---|---|
| Base model | google/flan-t5-small (~77M params) |
| Fine-tuning | Full fine-tune (not LoRA β at this scale full fine-tuning is cheap and simpler, and typically matches or beats LoRA) |
| Epochs | 8 |
| Effective batch size | 16 (per-device 4 Γ gradient accumulation 4) |
| Learning rate | 3e-4 |
| Gradient checkpointing | Enabled |
| Max input / target length | 160 / 200 tokens |
Notable fixes required during training
- Vocabulary gap: flan-t5-small's SentencePiece tokenizer has no token at all for
{or}β both map to<unk>, making valid JSON output architecturally impossible out of the box. Fixed bytokenizer.add_tokens(["{", "}"])+model.resize_token_embeddings(...), sized off the model's actual embedding row count rather thanlen(tokenizer)(the two disagreed by 28 rows in the base checkpoint β resizing tolen(tokenizer)would have silently truncated real rows). - Tied-embeddings mismatch: the base checkpoint's config claims
shared.weightandlm_head.weightare tied, but the actual tensors differ. Left uncorrected, safetensors serialization trusts the config flag, deduplicates the tensors on save, and randomizes them back apart on reload. Fixed by explicitly settingmodel.config.tie_word_embeddings = Falsebefore saving.
Evaluation (228 held-out test samples)
- JSON parse rate: 100%
- Overall exact match (all present fields): 12.3%
- Mean per-field accuracy: 80.6%
| Field | Accuracy | Recall | Gold presence |
|---|---|---|---|
| pincode | 97.8% | 97.8% | 100.0% |
| state | 94.3% | 95.6% | 98.7% |
| district | 93.9% | 94.7% | 65.8% |
| city | 89.0% | 87.0% | 67.5% |
| subDistrict | 88.2% | 13.0% | 10.1% |
| houseNumber | 87.3% | 83.2% | 52.2% |
| houseName | 81.6% | 85.0% | 43.9% |
| poi | 80.7% | 0.0% | 18.9% |
| subsubLocality | 72.4% | 30.6% | 27.2% |
| subLocality | 71.1% | 14.0% | 21.9% |
| village | 66.2% | 35.3% | 22.4% |
| street | 68.9% | 50.4% | 53.9% |
| locality | 56.6% | 27.4% | 41.7% |
Known limitations
- Lower ceiling than the Qwen3-0.6B model (~80.6% vs. ~82.4% mean field accuracy) β
expected given ~8x fewer parameters; the gap is concentrated in the low-recall fields
(
poi,subDistrict,subLocality) where the larger model's extra capacity helps most with rare/ambiguous patterns. - Same
locality/subLocality/subsubLocality/villageconceptual overlap noted on the Qwen3 model card applies here too β these represent the same "named area, different granularity" concept, and the gold labels themselves are sometimes inconsistent about which bucket a place name belongs in. poihas 0% recall despite 80.7% accuracy: the model defaults tonullfor this field far more often than gold does, so it's "correct" mostly by gold also being null, not by successfully extracting POIs when present.
Administrative fields via pincode gazetteer (optional, additive)
gazetteer_lookup.py (included) can look up a pincode against an India Post pincode CSV
and add three new, always-populated fields β districtAdministrative,
stateAdministrative, cityAdministrative β without touching the model's own verbatim
district/state/city output. This is deliberately additive rather than a correction:
overwriting or filling the model's fields with gazetteer values was tried and measured to
regress accuracy, because gold labels are verbatim-from-text (see "Training data"
above) and a pincode lookup always has an answer even when the source text β and
correctly, the gold label β doesn't mention one.
from gazetteer_lookup import load_gazetteer, add_administrative_fields
gazetteer = load_gazetteer("pincodes.csv") # India Post format
result = add_administrative_fields(model_output, gazetteer)
On the 228-sample test set: 98.7% administrative-field coverage, with zero impact on the original field accuracy numbers above (verified by re-running evaluation with and without the wrapper).
Files in this repo
model.safetensors,config.json,generation_config.jsonβ the fine-tuned modeltokenizer.json,tokenizer_config.jsonβ tokenizer (includes the added{/}tokens)inference_t5.py/evaluate_t5.pyβ standalone CLI scripts (single address, stdin/file batch, or full test-set evaluation), only needtransformers+torchgazetteer_lookup.pyβ optional pincode β district/state/city lookup wrapper (needs a pincodes CSV, not included here β see the GitHub repo for the source used)
License
Apache 2.0, inherited from the base model (google/flan-t5-small).
- Downloads last month
- -
Model tree for gagan1985/flan-t5-small-indian-address-parser
Base model
google/flan-t5-small