|
--- |
|
language: |
|
- yue |
|
pretty_name: "Cantonese (yue) G2PW model - bert base" |
|
tags: |
|
- g2p |
|
license: "cc-by-4.0" |
|
task_categories: |
|
- text2text-generation |
|
datasets: |
|
- Naozumi0512/g2p-Cantonese-aggregate-pos-retag |
|
--- |
|
|
|
# g2pW-canto-20241206-bert-base |
|
|
|
This is a **G2P (Grapheme-to-Phoneme)** model trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset and evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark). |
|
|
|
## Model Overview |
|
|
|
The model uses **[hon9kon9ize/bert-base-cantonese](https://huggingface.co/hon9kon9ize/bert-base-cantonese)**. For more details see https://github.com/Naozumi520/g2pW-Cantonese . |
|
|
|
--- |
|
|
|
## Dataset |
|
|
|
The model was trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset, which includes: |
|
|
|
- **68,500 Cantonese words/phrases** with corresponding phonetic transcriptions. |
|
- Data is formatted to align with the **CPP (Chinese Polyphones with Pinyin)** structure. |
|
- Sources include: |
|
- Rime Cantonese Input Schema (`jyut6ping3.words.dict.yaml`) |
|
- 粵典 Words.hk |
|
- CantoDict |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
The model was evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark): |
|
|
|
| Metric | Score | |
|
|-------------------------|--------| |
|
| **Accuracy** | 0.6873 | |
|
| **Levenshtein Distance**| 0.1789 | |
|
| **Phoneme Error Rate** | 0.2083 | |
|
|
|
--- |
|
|
|
## Inference |
|
|
|
https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base |
|
|