--- language: - yue pretty_name: "Cantonese (yue) G2PW model - bert base" tags: - g2p license: "cc-by-4.0" task_categories: - text2text-generation datasets: - Naozumi0512/g2p-Cantonese-aggregate-pos-retag --- # g2pW-canto-20241206-bert-base This is a **G2P (Grapheme-to-Phoneme)** model trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset and evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark). ## Model Overview The model uses **[hon9kon9ize/bert-base-cantonese](https://huggingface.co/hon9kon9ize/bert-base-cantonese)**. For more details see https://github.com/Naozumi520/g2pW-Cantonese . --- ## Dataset The model was trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset, which includes: - **68,500 Cantonese words/phrases** with corresponding phonetic transcriptions. - Data is formatted to align with the **CPP (Chinese Polyphones with Pinyin)** structure. - Sources include: - Rime Cantonese Input Schema (`jyut6ping3.words.dict.yaml`) - 粵典 Words.hk - CantoDict --- ## Evaluation The model was evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark): | Metric | Score | |-------------------------|--------| | **Accuracy** | 0.6873 | | **Levenshtein Distance**| 0.1789 | | **Phoneme Error Rate** | 0.2083 | --- ## Inference https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base