flan-t5-small β€” Indian Address Parser

A full fine-tune of google/flan-t5-small that parses raw, unstructured Indian address strings into 13 structured fields, output as JSON. This is a smaller, encoder-decoder alternative to gagan1985/qwen3-0.6b-indian-address-parser (~77M trainable params here vs. a 596M-param causal LM there) β€” useful where a compact, CPU-friendly model matters more than the last few points of accuracy.

Input:  "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
         "street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
         "village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
         "state": "AS", "pincode": "781029"}

Usage

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

repo = "gagan1985/flan-t5-small-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)

address = "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
inputs = tokenizer("parse indian address: " + address, return_tensors="pt")
out = model.generate(**inputs, max_length=200, num_beams=1, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

No chat template or system prompt β€” this is a plain seq2seq model, prompted with a short task prefix ("parse indian address: ") the way T5 models expect.

Or use the included inference_t5.py / evaluate_t5.py (download the repo, then python inference_t5.py --model . "<address>"). Both are standalone β€” only transformers+torch required.

Note on a load-time warning: you'll see "...specifies to tie shared.weight to lm_head.weight, but both are present in the checkpoints with different values, so we will NOT tie them." on load. This is expected and harmless β€” the two weight matrices are genuinely different (the base checkpoint's config claims tied embeddings but this fine-tune's config.json correctly overrides that to tie_word_embeddings: false); transformers loads both real tensors correctly despite the noisy warning message.

Also available

Datasets

Fields

houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode

Training data

  • 4,566 unique gold-labeled records (deduplicated), split 4,110 train / 228 val / 228 test
  • Gold labels are always copied verbatim from the raw address text β€” never paraphrased or normalized. A field is null in gold whenever the source text simply doesn't mention it, not because the true value is unknown.
  • Sourced from two distinct raw formats: Indian MCA (Ministry of Corporate Affairs) company-registration addresses, and bank/business-correspondent branch addresses

Training config

Parameter Value
Base model google/flan-t5-small (~77M params)
Fine-tuning Full fine-tune (not LoRA β€” at this scale full fine-tuning is cheap and simpler, and typically matches or beats LoRA)
Epochs 8
Effective batch size 16 (per-device 4 Γ— gradient accumulation 4)
Learning rate 3e-4
Gradient checkpointing Enabled
Max input / target length 160 / 200 tokens

Notable fixes required during training

  • Vocabulary gap: flan-t5-small's SentencePiece tokenizer has no token at all for { or } β€” both map to <unk>, making valid JSON output architecturally impossible out of the box. Fixed by tokenizer.add_tokens(["{", "}"]) + model.resize_token_embeddings(...), sized off the model's actual embedding row count rather than len(tokenizer) (the two disagreed by 28 rows in the base checkpoint β€” resizing to len(tokenizer) would have silently truncated real rows).
  • Tied-embeddings mismatch: the base checkpoint's config claims shared.weight and lm_head.weight are tied, but the actual tensors differ. Left uncorrected, safetensors serialization trusts the config flag, deduplicates the tensors on save, and randomizes them back apart on reload. Fixed by explicitly setting model.config.tie_word_embeddings = False before saving.

Evaluation (228 held-out test samples)

  • JSON parse rate: 100%
  • Overall exact match (all present fields): 12.3%
  • Mean per-field accuracy: 80.6%
Field Accuracy Recall Gold presence
pincode 97.8% 97.8% 100.0%
state 94.3% 95.6% 98.7%
district 93.9% 94.7% 65.8%
city 89.0% 87.0% 67.5%
subDistrict 88.2% 13.0% 10.1%
houseNumber 87.3% 83.2% 52.2%
houseName 81.6% 85.0% 43.9%
poi 80.7% 0.0% 18.9%
subsubLocality 72.4% 30.6% 27.2%
subLocality 71.1% 14.0% 21.9%
village 66.2% 35.3% 22.4%
street 68.9% 50.4% 53.9%
locality 56.6% 27.4% 41.7%

Known limitations

  • Lower ceiling than the Qwen3-0.6B model (~80.6% vs. ~82.4% mean field accuracy) β€” expected given ~8x fewer parameters; the gap is concentrated in the low-recall fields (poi, subDistrict, subLocality) where the larger model's extra capacity helps most with rare/ambiguous patterns.
  • Same locality/subLocality/subsubLocality/village conceptual overlap noted on the Qwen3 model card applies here too β€” these represent the same "named area, different granularity" concept, and the gold labels themselves are sometimes inconsistent about which bucket a place name belongs in.
  • poi has 0% recall despite 80.7% accuracy: the model defaults to null for this field far more often than gold does, so it's "correct" mostly by gold also being null, not by successfully extracting POIs when present.

Administrative fields via pincode gazetteer (optional, additive)

gazetteer_lookup.py (included) can look up a pincode against an India Post pincode CSV and add three new, always-populated fields β€” districtAdministrative, stateAdministrative, cityAdministrative β€” without touching the model's own verbatim district/state/city output. This is deliberately additive rather than a correction: overwriting or filling the model's fields with gazetteer values was tried and measured to regress accuracy, because gold labels are verbatim-from-text (see "Training data" above) and a pincode lookup always has an answer even when the source text β€” and correctly, the gold label β€” doesn't mention one.

from gazetteer_lookup import load_gazetteer, add_administrative_fields

gazetteer = load_gazetteer("pincodes.csv")  # India Post format
result = add_administrative_fields(model_output, gazetteer)

On the 228-sample test set: 98.7% administrative-field coverage, with zero impact on the original field accuracy numbers above (verified by re-running evaluation with and without the wrapper).

Files in this repo

  • model.safetensors, config.json, generation_config.json β€” the fine-tuned model
  • tokenizer.json, tokenizer_config.json β€” tokenizer (includes the added {/} tokens)
  • inference_t5.py / evaluate_t5.py β€” standalone CLI scripts (single address, stdin/file batch, or full test-set evaluation), only need transformers+torch
  • gazetteer_lookup.py β€” optional pincode β†’ district/state/city lookup wrapper (needs a pincodes CSV, not included here β€” see the GitHub repo for the source used)

License

Apache 2.0, inherited from the base model (google/flan-t5-small).

Downloads last month
-
Safetensors
Model size
77M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gagan1985/flan-t5-small-indian-address-parser

Finetuned
(503)
this model