Qwen3-0.6B LoRA β€” Indian Address Parser

A LoRA adapter fine-tuned on Qwen/Qwen3-0.6B that parses raw, unstructured Indian address strings into 13 structured fields, output as JSON.

Input:  "FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029"
Output: {"houseNumber": "FLAT NO.32", "houseName": "UTTARA TOWERS", "poi": null,
         "street": "MG ROAD", "subsubLocality": null, "subLocality": null, "locality": null,
         "village": null, "subDistrict": null, "district": "Kamrup", "city": "GUWAHATI",
         "state": "AS", "pincode": "781029"}

Also available as a Python package

pip install indian-address-parser

Source, CLI, and a benchmark comparison against shiprocket-ai/open-tinybert-indian-address-ner are on GitHub: innerkorehq/indian-address-parser.

Datasets

Two loading paths

Training was done with MLX (mlx-lm's lora command) on Apple Silicon. The repo root has a PEFT-format conversion of that adapter so it loads on any platform (CUDA, MPS, CPU) via standard transformers+peft β€” the mlx/ subfolder has the original MLX artifacts for Apple Silicon users. Both were verified to produce matching output (13/15 identical on a held-out spot check; the 2 differences landed on fields already noted as ambiguous below β€” consistent with floating-point differences between backends on a near-tied decision, not a conversion error).

Option A β€” PEFT (transformers, any platform)

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "gagan1985/qwen3-0.6b-indian-address-parser"
tokenizer = AutoTokenizer.from_pretrained(repo)
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, repo)

SYSTEM_PROMPT = (
    "You are an Indian address parser. Given a raw address string, extract address "
    "fields and return them as a JSON object. Use null for fields not present in the "
    "address. Output only the JSON object, no explanation.\n\n"
    "Fields: houseNumber, houseName, poi, street, subsubLocality, subLocality, "
    "locality, village, subDistrict, district, city, state, pincode"
)
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Parse this address:\n<your address here>"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Or use the included inference.py / evaluate.py (download the repo, then python inference.py --model Qwen/Qwen3-0.6B --adapter . "<address>").

Option B β€” MLX (Apple Silicon)

mlx_lm.load's adapter_path only accepts a local directory, not an HF repo ID β€” so fetch the mlx/ subfolder first, then point at it locally:

from huggingface_hub import snapshot_download
import mlx_lm

local_dir = snapshot_download("gagan1985/qwen3-0.6b-indian-address-parser", allow_patterns=["mlx/*"])
model, tokenizer = mlx_lm.load("Qwen/Qwen3-0.6B", adapter_path=f"{local_dir}/mlx")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Parse this address:\n<your address here>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
print(mlx_lm.generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=False))

Or use mlx/inference_mlx.py / mlx/evaluate_mlx.py from a local checkout.

Fields

houseNumber, houseName, poi, street, subsubLocality, subLocality,
locality, village, subDistrict, district, city, state, pincode

Training data

  • 5,008 gold-labeled records: 9 human-reviewed (Label Studio) + 4,999 LLM-reviewed (deepseek/deepseek-v4-pro via OpenRouter, span-verified against the raw address text β€” any field value not found verbatim in the source string was dropped, not guessed)
  • Sourced from two distinct raw formats: Indian MCA (Ministry of Corporate Affairs) company-registration addresses, and bank/business-correspondent branch addresses
  • Split 4,266 train / 237 val / 237 test (deduplicated)

Training config

Parameter Value
Base model Qwen/Qwen3-0.6B (28 layers)
LoRA rank / alpha / dropout 16 / 32 / 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Fine-tuned layers 16 (of 28)
Iterations 2000 (β‰ˆ3.75 epochs at batch=8)
Learning rate 2e-4
Trainable params 5.77M / 596M (0.968%)

Evaluation (237 held-out test samples)

  • JSON parse rate: 100%
  • Overall exact match (all present fields): 19.0%
  • Mean per-field accuracy: 82.4%
Field Accuracy Recall Gold presence
pincode 100.0% 100.0% 100.0%
state 94.9% 96.2% 98.7%
district 94.9% 95.1% 68.8%
houseNumber 90.3% 84.5% 54.4%
city 90.3% 91.3% 62.9%
houseName 87.3% 88.5% 43.9%
subDistrict 86.1% 29.6% 11.4%
poi 84.0% 30.8% 16.5%
village 75.1% 43.8% 27.0%
subLocality 74.3% 21.6% 21.5%
subsubLocality 72.6% 31.2% 27.0%
street 64.6% 55.6% 53.2%
locality 56.5% 35.6% 43.9%

Known limitations

  • locality/subLocality/subsubLocality/village overlap conceptually. These represent the same "named area, different granularity" concept, and disagreements are often the model extracting the correct substring into an adjacent bucket rather than a genuine extraction miss β€” the gold labels themselves are sometimes inconsistent here (the same place name occasionally appears in two of these fields simultaneously in gold).
  • street sometimes over-absorbs multi-part location clusters (e.g. "KARBI PATH, ZOO NARENGI ROAD, BAMUNIMAIDAN" tagged entirely as street instead of being split across street/subsubLocality/subLocality), likely because that split pattern is underrepresented in training data (subsubLocality/subLocality gold presence is only ~21-27%).
  • Source addresses occasionally contain data artifacts (e.g. duplicated substrings from the original MCA records, like "PEDANANDIPALLE AGRAHARAMPEDANANDIPALLE AGRAHARAM") that the model sometimes reproduces with minor copy errors.

Files in this repo

Root (PEFT format):

  • adapter_model.safetensors + adapter_config.json β€” standard PEFT LoRA adapter
  • tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, added_tokens.json, chat_template.jinja β€” tokenizer files
  • inference.py / evaluate.py β€” CLI scripts (single address, stdin/file batch, or full test-set evaluation) using transformers+peft
  • config.py β€” shared constants (SYSTEM_PROMPT, field list, prompt template)

mlx/ (original MLX format, Apple Silicon):

  • adapters.safetensors + adapter_config.json β€” mlx-lm's native adapter format (adapter_config.json here is mlx-lm's training-run metadata, not a PEFT LoraConfig)
  • inference_mlx.py / evaluate_mlx.py β€” same CLI shape as the root scripts, via mlx-lm
  • config.py β€” same constants, duplicated here so this subfolder is usable standalone
Downloads last month
-
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gagan1985/qwen3-0.6b-indian-address-parser

Finetuned
Qwen/Qwen3-0.6B
Adapter
(440)
this model