metadata
language:
- yue
pretty_name: Cantonese (yue) G2PW model - bert base
tags:
- g2p
license: cc-by-4.0
task_categories:
- text2text-generation
datasets:
- Naozumi0512/g2p-Cantonese-aggregate-pos-retag
g2pW-canto-20241206-bert-base
This is a G2P (Grapheme-to-Phoneme) model trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset and evaluated on the yue-g2p-benchmark.
Model Overview
The model uses hon9kon9ize/bert-base-cantonese. For more details see https://github.com/Naozumi520/g2pW-Cantonese .
Dataset
The model was trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset, which includes:
- 68,500 Cantonese words/phrases with corresponding phonetic transcriptions.
- Data is formatted to align with the CPP (Chinese Polyphones with Pinyin) structure.
- Sources include:
- Rime Cantonese Input Schema (
jyut6ping3.words.dict.yaml
) - 粵典 Words.hk
- CantoDict
- Rime Cantonese Input Schema (
Evaluation
The model was evaluated on the yue-g2p-benchmark:
Metric | Score |
---|---|
Accuracy | 0.6873 |
Levenshtein Distance | 0.1789 |
Phoneme Error Rate | 0.2083 |
Inference
https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base