Token Classification
Transformers
Safetensors
Vietnamese
bert
ner
vietnamese
address-parsing
Eval Results (legacy)
Instructions to use open-thienhang-com/viet-address-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use open-thienhang-com/viet-address-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="open-thienhang-com/viet-address-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("open-thienhang-com/viet-address-v2") model = AutoModelForTokenClassification.from_pretrained("open-thienhang-com/viet-address-v2") - Notebooks
- Google Colab
- Kaggle
viet-address-v2
Fine-tuned PhoBERT for Vietnamese address Named Entity Recognition. Extracts structured fields from free-form Vietnamese addresses (place name, house number, street, ward, district, city).
Source artefact: bert_vietmap_gmaps_p6ss_20260620_222446
Base model: outputs/pretrain/bert_compact_v2
Trained on: Vietnamese address corpus (mixed sources)
Saved at: 2026-06-20T16:39:26.627984+00:00
Metrics
| Metric | Value |
|---|---|
| Precision | 0.8657 |
| Recall | 0.9946 |
| F1 | 0.9257 |
Labels
<PAD>B-DISTRICTB-NEIGHBORHOODB-POIB-PREMISEB-PROVINCEB-ROUTEB-STREET_NUMBERB-SUBPREMISEB-WARDI-DISTRICTI-NEIGHBORHOODI-POII-PREMISEI-PROVINCEI-ROUTEI-STREET_NUMBERI-SUBPREMISEI-WARDO
Quickstart
from transformers import pipeline
ner = pipeline("token-classification", model="open-thienhang-com/viet-address-v2", aggregation_strategy="simple")
ner("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP Hồ Chí Minh")
# → [
# {'entity_group': 'HOUSE_NUMBER', 'word': '123', ...},
# {'entity_group': 'STREET', 'word': 'Nguyễn Huệ', ...},
# {'entity_group': 'WARD', 'word': 'Phường Bến Nghé', ...},
# {'entity_group': 'DISTRICT', 'word': 'Quận 1', ...},
# {'entity_group': 'CITY', 'word': 'TP Hồ Chí Minh', ...},
# ]
Or manual:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("open-thienhang-com/viet-address-v2")
model = AutoModelForTokenClassification.from_pretrained("open-thienhang-com/viet-address-v2").eval()
inputs = tokenizer("123 Nguyễn Huệ, Phường Bến Nghé, Quận 1, TP HCM",
return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
out = model(**inputs).logits
pred_ids = out.argmax(-1)[0].tolist()
labels = [model.config.id2label[i] for i in pred_ids]
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
for t, l in zip(tokens, labels):
if l != "O" and l != "<PAD>":
print(f" {t:20s} → {l}")
Limitations
- Trained on Vietnamese addresses ONLY — won't generalise to free-form Vietnamese text or addresses from other countries.
- Uses syllable-level input (no
vncorenlpword segmentation required). - Class imbalance:
PLACE_NAMEis the rarest label, so its precision is lower than CITY / WARD.
Citation
If you use this model, please credit the base model and dataset sources:
@misc{phobert,
title = {PhoBERT: Pre-trained language models for Vietnamese},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
year = {2020}
}
- Downloads last month
- 272
Space using open-thienhang-com/viet-address-v2 1
Evaluation results
- f1self-reported0.926
- precisionself-reported0.866
- recallself-reported0.995