Naozumi0512's picture
Update README.md
f40b583 verified
metadata
language:
  - yue
pretty_name: Cantonese (yue) G2PW model - bert base
tags:
  - g2p
license: cc-by-4.0
task_categories:
  - text2text-generation
datasets:
  - Naozumi0512/g2p-Cantonese-aggregate-pos-retag

g2pW-canto-20241206-bert-base

This is a G2P (Grapheme-to-Phoneme) model trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset and evaluated on the yue-g2p-benchmark.

Model Overview

The model uses hon9kon9ize/bert-base-cantonese. For more details see https://github.com/Naozumi520/g2pW-Cantonese .


Dataset

The model was trained on the Naozumi0512/g2p-Cantonese-aggregate-pos-retag dataset, which includes:

  • 68,500 Cantonese words/phrases with corresponding phonetic transcriptions.
  • Data is formatted to align with the CPP (Chinese Polyphones with Pinyin) structure.
  • Sources include:
    • Rime Cantonese Input Schema (jyut6ping3.words.dict.yaml)
    • 粵典 Words.hk
    • CantoDict

Evaluation

The model was evaluated on the yue-g2p-benchmark:

Metric Score
Accuracy 0.6873
Levenshtein Distance 0.1789
Phoneme Error Rate 0.2083

Inference

https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base