language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: Saya [MASK] makan nasi goreng.
- text: Kucing itu sedang bermain dengan [MASK].
Indonesian small BigBird model
Disclaimer: This is work in progress. Current checkpoint is trained with ~1.0 epoch/6450 steps with 2.565 train loss and 2.466 eval loss. Newer checkpoint and additional information will be added in the future.
Model Description
This model was pretrained only with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole cased dataset with only 30K vocabulary size.
config = BigBirdConfig(
vocab_size = 30_000,
hidden_size = 512,
num_hidden_layers = 4,
num_attention_heads = 8,
intermediate_size = 2048,
max_position_embeddings = 4096,
is_encoder_decoder=False,
attention_type='block_sparse'
)
How to use
TBD
Limitations and bias
TBD
Training and evaluation data
This model was pretrained with Indonesian Wikipedia with dump file from 2022-10-20, OSCAR on subset unshuffled_deduplicated_id
and Indonesian Newspaper 2018. Preprocessing is done using function from task guides - language modeling with 4096 block size. Each dataset is splitted using train_test_split
with 5% allocation as evaluation data.
Training Procedure
TBD
Evaluation
TBD