viet-address-v2

Fine-tuned PhoBERT for Vietnamese address Named Entity Recognition. Extracts structured fields from free-form Vietnamese addresses (place name, house number, street, ward, district, city).

Source artefact: bert_vietmap_gmaps_p6ss_20260620_222446 Base model: outputs/pretrain/bert_compact_v2 Trained on: Vietnamese address corpus (mixed sources) Saved at: 2026-06-20T16:39:26.627984+00:00

Metrics

Metric Value
Precision 0.8657
Recall 0.9946
F1 0.9257

Labels

  • <PAD>
  • B-DISTRICT
  • B-NEIGHBORHOOD
  • B-POI
  • B-PREMISE
  • B-PROVINCE
  • B-ROUTE
  • B-STREET_NUMBER
  • B-SUBPREMISE
  • B-WARD
  • I-DISTRICT
  • I-NEIGHBORHOOD
  • I-POI
  • I-PREMISE
  • I-PROVINCE
  • I-ROUTE
  • I-STREET_NUMBER
  • I-SUBPREMISE
  • I-WARD
  • O

Quickstart

from transformers import pipeline

ner = pipeline("token-classification", model="open-thienhang-com/viet-address-v2", aggregation_strategy="simple")
ner("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP Hồ Chí Minh")
# → [
#     {'entity_group': 'HOUSE_NUMBER', 'word': '123', ...},
#     {'entity_group': 'STREET',        'word': 'Nguyễn Huệ', ...},
#     {'entity_group': 'WARD',          'word': 'Phường Bến Nghé', ...},
#     {'entity_group': 'DISTRICT',      'word': 'Quận 1', ...},
#     {'entity_group': 'CITY',          'word': 'TP Hồ Chí Minh', ...},
#   ]

Or manual:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("open-thienhang-com/viet-address-v2")
model = AutoModelForTokenClassification.from_pretrained("open-thienhang-com/viet-address-v2").eval()

inputs = tokenizer("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP HCM",
                   return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    out = model(**inputs).logits
pred_ids = out.argmax(-1)[0].tolist()
labels = [model.config.id2label[i] for i in pred_ids]
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
for t, l in zip(tokens, labels):
    if l != "O" and l != "<PAD>":
        print(f"  {t:20s}{l}")

Limitations

  • Trained on Vietnamese addresses ONLY — won't generalise to free-form Vietnamese text or addresses from other countries.
  • Uses syllable-level input (no vncorenlp word segmentation required).
  • Class imbalance: PLACE_NAME is the rarest label, so its precision is lower than CITY / WARD.

Citation

If you use this model, please credit the base model and dataset sources:

@misc{phobert,
  title  = {PhoBERT: Pre-trained language models for Vietnamese},
  author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  year   = {2020}
}
Downloads last month
272
Safetensors
Model size
16.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using open-thienhang-com/viet-address-v2 1

Evaluation results