viet-address-v2

Fine-tuned PhoBERT for Vietnamese address Named Entity Recognition. Extracts structured fields from free-form Vietnamese addresses (place name, house number, street, ward, district, city).

Source artefact: bert_vietmap_gmaps_p6ss_20260620_222446 Base model: outputs/pretrain/bert_compact_v2 Trained on: Vietnamese address corpus (mixed sources) Saved at: 2026-06-20T16:39:26.627984+00:00

Metrics

Metric	Value
Precision	0.8657
Recall	0.9946
F1	0.9257

Labels

<PAD>
B-DISTRICT
B-NEIGHBORHOOD
B-POI
B-PREMISE
B-PROVINCE
B-ROUTE
B-STREET_NUMBER
B-SUBPREMISE
B-WARD
I-DISTRICT
I-NEIGHBORHOOD
I-POI
I-PREMISE
I-PROVINCE
I-ROUTE
I-STREET_NUMBER
I-SUBPREMISE
I-WARD
O

Quickstart

from transformers import pipeline

ner = pipeline("token-classification", model="open-thienhang-com/viet-address-v2", aggregation_strategy="simple")
ner("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP Hồ Chí Minh")
# → [
#     {'entity_group': 'HOUSE_NUMBER', 'word': '123', ...},
#     {'entity_group': 'STREET',        'word': 'Nguyễn Huệ', ...},
#     {'entity_group': 'WARD',          'word': 'Phường Bến Nghé', ...},
#     {'entity_group': 'DISTRICT',      'word': 'Quận 1', ...},
#     {'entity_group': 'CITY',          'word': 'TP Hồ Chí Minh', ...},
#   ]

Or manual:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("open-thienhang-com/viet-address-v2")
model = AutoModelForTokenClassification.from_pretrained("open-thienhang-com/viet-address-v2").eval()

inputs = tokenizer("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP HCM",
                   return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    out = model(**inputs).logits
pred_ids = out.argmax(-1)[0].tolist()
labels = [model.config.id2label[i] for i in pred_ids]
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
for t, l in zip(tokens, labels):
    if l != "O" and l != "<PAD>":
        print(f"  {t:20s} → {l}")

Limitations

Trained on Vietnamese addresses ONLY — won't generalise to free-form Vietnamese text or addresses from other countries.
Uses syllable-level input (no vncorenlp word segmentation required).
Class imbalance: PLACE_NAME is the rarest label, so its precision is lower than CITY / WARD.

Citation

If you use this model, please credit the base model and dataset sources:

@misc{phobert,
  title  = {PhoBERT: Pre-trained language models for Vietnamese},
  author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  year   = {2020}
}

Downloads last month: 272

Safetensors

Model size

16.9M params

Tensor type

F32

Space using open-thienhang-com/viet-address-v2 1

Evaluation results

f1
self-reported

0.926
precision
self-reported

0.866
recall
self-reported

0.995