bookbot
/

id-g2p-bert

Model card Files Files and versions Metrics Training metrics Community

id-g2p-bert / README.md

w11wo's picture

Update README.md

eb227e9 about 2 years ago

|

raw history blame contribute delete

No virus

3.73 kB

	---
	language:
	- id
	- ms
	license: apache-2.0
	tags:
	- g2p
	- fill-mask
	inference: false
	---

	# ID G2P BERT

	ID G2P BERT is a phoneme de-masking model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme).

	This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [BERT Masked Language Modeling training script](https://keras.io/examples/nlp/masked_language_modeling) provided by the official Keras Code Example.

	## Model

	\| Model \| #params \| Arch. \| Training/Validation data \|
	\| ------------- \| ------- \| ----- \| ------------------------ \|
	\| `id-g2p-bert` \| 200K \| BERT \| Malay/Indonesian Lexicon \|

	![](./model.png)

	## Training Procedure

	<details>
	<summary>Model Config</summary>

	vocab_size: 32
	max_len: 32
	embed_dim: 128
	num_attention_head: 2
	feed_forward_dim: 128
	num_layers: 2

	</details>

	<details>
	<summary>Training Setting</summary>

	batch_size: 32
	optimizer: "adam"
	learning_rate: 0.001
	epochs: 100

	</details>

	## How to Use

	<details>
	<summary>Tokenizers</summary>

	id2token = {
	0: '',
	1: '[UNK]',
	2: 'a',
	3: 'n',
	4: 'ə',
	5: 'i',
	6: 'r',
	7: 'k',
	8: 'm',
	9: 't',
	10: 'u',
	11: 'g',
	12: 's',
	13: 'b',
	14: 'p',
	15: 'l',
	16: 'd',
	17: 'o',
	18: 'e',
	19: 'h',
	20: 'c',
	21: 'y',
	22: 'j',
	23: 'w',
	24: 'f',
	25: 'v',
	26: '-',
	27: 'z',
	28: "'",
	29: 'q',
	30: '[mask]'
	}

	token2id = {
	'': 0,
	"'": 28,
	'-': 26,
	'[UNK]': 1,
	'[mask]': 30,
	'a': 2,
	'b': 13,
	'c': 20,
	'd': 16,
	'e': 18,
	'f': 24,
	'g': 11,
	'h': 19,
	'i': 5,
	'j': 22,
	'k': 7,
	'l': 15,
	'm': 8,
	'n': 3,
	'o': 17,
	'p': 14,
	'q': 29,
	'r': 6,
	's': 12,
	't': 9,
	'u': 10,
	'v': 25,
	'w': 23,
	'y': 21,
	'z': 27,
	'ə': 4
	}

	</details>

	```py
	import keras
	import tensorflow as tf
	import numpy as np
	from huggingface_hub import from_pretrained_keras

	model = from_pretrained_keras("bookbot/id-g2p-bert")

	MAX_LEN = 32
	MASK_TOKEN_ID = 30

	def inference(sequence):
	sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
	tokens = [token2id[c] for c in sequence.split()]
	pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]

	tokens = tokens + pad
	input_ids = tf.convert_to_tensor(np.array([tokens]))
	prediction = model.predict(input_ids)

	# find masked idx token
	masked_index = np.where(input_ids == MASK_TOKEN_ID)
	masked_index = masked_index[1]

	# get prediction at those masked index only
	mask_prediction = prediction[0][masked_index]
	predicted_ids = np.argmax(mask_prediction, axis=1)

	# replace mask with predicted token
	for i, idx in enumerate(masked_index):
	tokens[idx] = predicted_ids[i]

	return "".join([id2token[t] for t in tokens if t != 0])

	inference("mengembangkannya")
	```

	## Authors

	ID G2P BERT was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory.

	## Framework versions

	- Keras 2.8.0
	- TensorFlow 2.8.0