File size: 1,580 Bytes
8cbbf2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f40b583
8cbbf2e
 
 
 
 
 
 
 
 
f40b583
8cbbf2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f43533
 
8cbbf2e
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: 
- yue
pretty_name: "Cantonese (yue) G2PW model - bert base"
tags:
- g2p
license: "cc-by-4.0"
task_categories:
- text2text-generation
datasets:
- Naozumi0512/g2p-Cantonese-aggregate-pos-retag
---

# g2pW-canto-20241206-bert-base

This is a **G2P (Grapheme-to-Phoneme)** model trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset and evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark).

## Model Overview

The model uses **[hon9kon9ize/bert-base-cantonese](https://huggingface.co/hon9kon9ize/bert-base-cantonese)**. For more details see https://github.com/Naozumi520/g2pW-Cantonese .  

---

## Dataset

The model was trained on the [Naozumi0512/g2p-Cantonese-aggregate-pos-retag](https://huggingface.co/datasets/Naozumi0512/g2p-Cantonese-aggregate-pos-retag) dataset, which includes:

- **68,500 Cantonese words/phrases** with corresponding phonetic transcriptions.
- Data is formatted to align with the **CPP (Chinese Polyphones with Pinyin)** structure.
- Sources include:
  - Rime Cantonese Input Schema (`jyut6ping3.words.dict.yaml`)
  - 粵典 Words.hk
  - CantoDict

---

## Evaluation

The model was evaluated on the [yue-g2p-benchmark](https://github.com/hon9kon9ize/yue-g2p-benchmark):

| Metric                  | Score  |
|-------------------------|--------|
| **Accuracy**            | 0.9117 |
| **Phoneme Error Rate**  | 0.0274 |

---

## Inference

https://github.com/Naozumi520/g2pW-Cantonese/tree/20241206-bert-base