metadata

language: id
license: mit
datasets:
  - oscar
  - wikipedia
  - id_newspapers_2018
widget:
  - text: Saya [MASK] makan nasi goreng.
  - text: Kucing itu sedang bermain dengan [MASK].

Indonesian small BigBird model

Disclaimer: This is work in progress. Current checkpoint is trained with ~6.0 epoch/38700 steps with 2.087 eval loss. Newer checkpoint and additional information will be added in the future.

Model Description

This model was pretrained only with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole cased dataset with only 30K vocabulary size.

config = BigBirdConfig(
    vocab_size = 30_000,
    hidden_size = 512,
    num_hidden_layers = 4,
    num_attention_heads = 8,
    intermediate_size = 2048,
    max_position_embeddings = 4096,
    is_encoder_decoder=False,
    attention_type='block_sparse'
)

How to use

TBD

Limitations and bias

TBD

Training and evaluation data

This model was pretrained with Indonesian Wikipedia with dump file from 2022-10-20, OSCAR on subset unshuffled_deduplicated_id and Indonesian Newspaper 2018. Preprocessing is done using function from task guides - language modeling with 4096 block size. Each dataset is splitted using train_test_split with 5% allocation as evaluation data.

Training Procedure

TBD

Evaluation

TBD