id-g2p-bert / README.md
w11wo's picture
Update README.md
1432ed9
|
raw
history blame
3.73 kB
---
language:
- id
- ms
license: apache-2.0
tags:
- g2p
- fill-mask
inference: false
---
# ID G2P BERT
ID G2P BERT is a phoneme de-masking model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme).
This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [BERT Masked Language Modeling training script](https://keras.io/examples/nlp/masked_language_modeling) provided by the official Keras Code Example.
## Model
| Model | #params | Arch. | Training/Validation data |
| ------------- | ------- | ----- | ------------------------ |
| `id-g2p-bert` | 200K | BERT | Malay/Indonesian Lexicon |
![](./model.png)
## Training Procedure
<details>
<summary>Model Config</summary>
vocab_size: 32
max_len: 32
embed_dim: 128
num_attention_head: 2
feed_forward_dim: 128
num_layers: 2
</details>
<details>
<summary>Training Setting</summary>
batch_size: 32
optimizer: "adam"
learning_rate: 0.001
epochs: 100
</details>
## How to Use
<details>
<summary>Tokenizers</summary>
id2token = {
0: '',
1: '[UNK]',
2: 'a',
3: 'n',
4: 'ə',
5: 'i',
6: 'r',
7: 'k',
8: 'm',
9: 't',
10: 'u',
11: 'g',
12: 's',
13: 'b',
14: 'p',
15: 'l',
16: 'd',
17: 'o',
18: 'e',
19: 'h',
20: 'c',
21: 'y',
22: 'j',
23: 'w',
24: 'f',
25: 'v',
26: '-',
27: 'z',
28: "'",
29: 'q',
30: '[mask]'
}
token2id = {
'': 0,
"'": 28,
'-': 26,
'[UNK]': 1,
'[mask]': 30,
'a': 2,
'b': 13,
'c': 20,
'd': 16,
'e': 18,
'f': 24,
'g': 11,
'h': 19,
'i': 5,
'j': 22,
'k': 7,
'l': 15,
'm': 8,
'n': 3,
'o': 17,
'p': 14,
'q': 29,
'r': 6,
's': 12,
't': 9,
'u': 10,
'v': 25,
'w': 23,
'y': 21,
'z': 27,
'ə': 4
}
</details>
```py
import keras
import tensorflow as tf
import numpy as np
mlm_model = keras.models.load_model(
"bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)
MAX_LEN = 32
def inference(sequence):
sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
tokens = [token2id[c] for c in sequence.split()]
pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]
tokens = tokens + pad
input_ids = tf.convert_to_tensor(np.array([tokens]))
prediction = mlm_model.predict(input_ids)
# find masked idx token
masked_index = np.where(input_ids == mask_token_id)
masked_index = masked_index[1]
# get prediction at those masked index only
mask_prediction = prediction[0][masked_index]
predicted_ids = np.argmax(mask_prediction, axis=1)
# replace mask with predicted token
for i, idx in enumerate(masked_index):
tokens[idx] = predicted_ids[i]
return "".join([id2token[t] for t in tokens if t != 0])
inference("mengembangkannya")
```
## Authors
ID G2P BERT was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory.
## Framework versions
- Keras 2.8.0
- TensorFlow 2.8.0