|
--- |
|
language: |
|
- id |
|
- ms |
|
license: apache-2.0 |
|
tags: |
|
- g2p |
|
- fill-mask |
|
inference: false |
|
--- |
|
|
|
# ID G2P BERT |
|
|
|
ID G2P BERT is a phoneme de-masking model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme). |
|
|
|
This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [BERT Masked Language Modeling training script](https://keras.io/examples/nlp/masked_language_modeling) provided by the official Keras Code Example. |
|
|
|
## Model |
|
|
|
| Model | #params | Arch. | Training/Validation data | |
|
| ------------- | ------- | ----- | ------------------------ | |
|
| `id-g2p-bert` | 200K | BERT | Malay/Indonesian Lexicon | |
|
|
|
![](./model.png) |
|
|
|
## Training Procedure |
|
|
|
<details> |
|
<summary>Model Config</summary> |
|
|
|
vocab_size: 32 |
|
max_len: 32 |
|
embed_dim: 128 |
|
num_attention_head: 2 |
|
feed_forward_dim: 128 |
|
num_layers: 2 |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Training Setting</summary> |
|
|
|
batch_size: 32 |
|
optimizer: "adam" |
|
learning_rate: 0.001 |
|
epochs: 100 |
|
|
|
</details> |
|
|
|
## How to Use |
|
|
|
<details> |
|
<summary>Tokenizers</summary> |
|
|
|
id2token = { |
|
0: '', |
|
1: '[UNK]', |
|
2: 'a', |
|
3: 'n', |
|
4: 'ə', |
|
5: 'i', |
|
6: 'r', |
|
7: 'k', |
|
8: 'm', |
|
9: 't', |
|
10: 'u', |
|
11: 'g', |
|
12: 's', |
|
13: 'b', |
|
14: 'p', |
|
15: 'l', |
|
16: 'd', |
|
17: 'o', |
|
18: 'e', |
|
19: 'h', |
|
20: 'c', |
|
21: 'y', |
|
22: 'j', |
|
23: 'w', |
|
24: 'f', |
|
25: 'v', |
|
26: '-', |
|
27: 'z', |
|
28: "'", |
|
29: 'q', |
|
30: '[mask]' |
|
} |
|
|
|
token2id = { |
|
'': 0, |
|
"'": 28, |
|
'-': 26, |
|
'[UNK]': 1, |
|
'[mask]': 30, |
|
'a': 2, |
|
'b': 13, |
|
'c': 20, |
|
'd': 16, |
|
'e': 18, |
|
'f': 24, |
|
'g': 11, |
|
'h': 19, |
|
'i': 5, |
|
'j': 22, |
|
'k': 7, |
|
'l': 15, |
|
'm': 8, |
|
'n': 3, |
|
'o': 17, |
|
'p': 14, |
|
'q': 29, |
|
'r': 6, |
|
's': 12, |
|
't': 9, |
|
'u': 10, |
|
'v': 25, |
|
'w': 23, |
|
'y': 21, |
|
'z': 27, |
|
'ə': 4 |
|
} |
|
|
|
</details> |
|
|
|
```py |
|
import keras |
|
import tensorflow as tf |
|
import numpy as np |
|
|
|
mlm_model = keras.models.load_model( |
|
"bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel} |
|
) |
|
|
|
MAX_LEN = 32 |
|
|
|
def inference(sequence): |
|
sequence = " ".join([c if c != "e" else "[mask]" for c in sequence]) |
|
tokens = [token2id[c] for c in sequence.split()] |
|
pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))] |
|
|
|
tokens = tokens + pad |
|
input_ids = tf.convert_to_tensor(np.array([tokens])) |
|
prediction = mlm_model.predict(input_ids) |
|
|
|
# find masked idx token |
|
masked_index = np.where(input_ids == mask_token_id) |
|
masked_index = masked_index[1] |
|
|
|
# get prediction at those masked index only |
|
mask_prediction = prediction[0][masked_index] |
|
predicted_ids = np.argmax(mask_prediction, axis=1) |
|
|
|
# replace mask with predicted token |
|
for i, idx in enumerate(masked_index): |
|
tokens[idx] = predicted_ids[i] |
|
|
|
return "".join([id2token[t] for t in tokens if t != 0]) |
|
|
|
inference("mengembangkannya") |
|
``` |
|
|
|
## Authors |
|
|
|
ID G2P BERT was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory. |
|
|
|
## Framework versions |
|
|
|
- Keras 2.8.0 |
|
- TensorFlow 2.8.0 |
|
|