Naozumi0512
/

g2pW-canto-20241206-bert-base

Model card Files Files and versions Community

g2pW-canto-20241206-bert-base

This is a G2P (Grapheme-to-Phoneme) model trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset and evaluated on the yue-g2p-benchmark.

Model Overview

The model uses hon9kon9ize/bert-base-cantonese. For more details see https://github.com/Naozumi520/g2pW-Cantonese .

Dataset

The model was trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset, which includes:

68,500 Cantonese words/phrases with corresponding phonetic transcriptions.
Data is formatted to align with the CPP (Chinese Polyphones with Pinyin) structure.
Sources include:
- Rime Cantonese Input Schema (jyut6ping3.words.dict.yaml)
- 粵典 Words.hk
- CantoDict

Evaluation

The model was evaluated on the yue-g2p-benchmark:

Metric	Score
Accuracy	0.6873
Levenshtein Distance	0.1789
Phoneme Error Rate	0.2083

Inference

https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference API

Unable to determine this model's library. Check the docs .

Dataset used to train Naozumi0512/g2pW-canto-20241206-bert-base

Collection including Naozumi0512/g2pW-canto-20241206-bert-base

Cantonese G2P (model)

Trained weights for G2P (Grapheme-to-Phoneme) task in Cantonese • 3 items • Updated 11 days ago