Naozumi0512
/

g2pW-canto-20241206-bert-base

Model card Files Files and versions Community

g2pW-canto-20241206-bert-base / README.md

Naozumi0512's picture

Update README.md

f40b583 verified 12 days ago

|

history blame contribute delete

1.62 kB

	---
	language:
	- yue
	pretty_name: "Cantonese (yue) G2PW model - bert base"
	tags:
	- g2p
	license: "cc-by-4.0"
	task_categories:
	- text2text-generation
	datasets:
	- Naozumi0512/g2p-Cantonese-aggregate-pos-retag
	---

	# g2pW-canto-20241206-bert-base

	This is a G2P (Grapheme-to-Phoneme) model trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset and evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark).

	## Model Overview

	The model uses [hon9kon9ize/bert-base-cantonese](https://huggingface.co/hon9kon9ize/bert-base-cantonese). For more details see https://github.com/Naozumi520/g2pW-Cantonese .

	---

	## Dataset

	The model was trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset, which includes:

	- 68,500 Cantonese words/phrases with corresponding phonetic transcriptions.
	- Data is formatted to align with the CPP (Chinese Polyphones with Pinyin) structure.
	- Sources include:
	- Rime Cantonese Input Schema (`jyut6ping3.words.dict.yaml`)
	- 粵典 Words.hk
	- CantoDict

	---

	## Evaluation

	The model was evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark):

	\| Metric \| Score \|
	\|-------------------------\|--------\|
	\| Accuracy \| 0.6873 \|
	\| Levenshtein Distance\| 0.1789 \|
	\| Phoneme Error Rate \| 0.2083 \|

	---

	## Inference

	https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base