--- language: id license: mit datasets: - oscar - wikipedia - id_newspapers_2018 widget: - text: Saya [MASK] makan nasi goreng. - text: Kucing itu sedang bermain dengan [MASK]. pipeline_tag: fill-mask --- # Indonesian small BigBird model ## Source Code Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian). ## Downstream Task * NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli) ## Model Description This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size. ```py from transformers import BigBirdConfig config = BigBirdConfig( vocab_size = 30_000, hidden_size = 512, num_hidden_layers = 4, num_attention_heads = 8, intermediate_size = 2048, max_position_embeddings = 4096, is_encoder_decoder=False, attention_type='block_sparse' ) ``` ## How to use > Inference with Transformers pipeline (one MASK token) ```py >>> from transformers import pipeline >>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian') >>> pipe('Saya sedang bermain [MASK] teman saya.') [{'score': 0.7199566960334778, 'token': 14, 'token_str':'dengan', 'sequence': 'Saya sedang bermain dengan teman saya.'}, {'score': 0.12370546162128448, 'token': 17, 'token_str': 'untuk', 'sequence': 'Saya sedang bermain untuk teman saya.'}, {'score': 0.0385284349322319, 'token': 331, 'token_str': 'bersama', 'sequence': 'Saya sedang bermain bersama teman saya.'}, {'score': 0.012146958149969578, 'token': 28, 'token_str': 'oleh', 'sequence': 'Saya sedang bermain oleh teman saya.'}, {'score': 0.009499032981693745, 'token': 25, 'token_str': 'sebagai', 'sequence': 'Saya sedang bermain sebagai teman saya.'}] ``` > Inference with PyTorch (one or multiple MASK token) ```py import torch from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM from pprint import pprint tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian') model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian') topk = 5 text = 'Saya [MASK] bermain [MASK] teman saya.' tokenized_text = tokenizer(text, return_tensors='pt') raw_output = model(**tokenized_text) tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices score_output = torch.softmax(raw_output.logits, dim=2) result = [] for position_idx in range(tokenized_text['input_ids'][0].shape[0]): if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id: outputs = [] for token_idx in tokenized_output[0, position_idx]: output = {} output['score'] = score_output[0, position_idx, token_idx].item() output['token'] = token_idx.item() output['token_str'] = tokenizer.decode(output['token']) outputs.append(output) result.append(outputs) pprint(result) ``` ```py [[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'}, {'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'}, {'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'}, {'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'}, {'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}], [{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'}, {'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'}, {'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'}, {'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'}, {'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]] ``` ## Limitations and bias Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions, ```py >>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].') [{'score': 0.16381049156188965, 'sequence': 'Memasak dirumah adalah kewajiban seorang budak.', 'token': 4910, 'token_str': 'budak'}, {'score': 0.1334381103515625, 'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.', 'token': 649, 'token_str': 'wanita'}, {'score': 0.11588197946548462, 'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.', 'token': 6368, 'token_str': 'lelaki'}, {'score': 0.061377108097076416, 'sequence': 'Memasak dirumah adalah kewajiban seorang diri.', 'token': 258, 'token_str': 'diri'}, {'score': 0.04679233580827713, 'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.', 'token': 6845, 'token_str': 'gadis'}] ``` ## Training and evaluation data This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data. ## Training Procedure The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs. ## Evaluation The model achieve the following result during training evaluation. | Epoch | Steps | Eval. loss | Eval. perplexity | | ----- | ----- | ---------- | ---------------- | | 1 | 6249 | 2.466 | 11.775 | | 2 | 12858 | 2.265 | 9.631 | | 3 | 19329 | 2.127 | 8.390 | | 4 | 25758 | 2.116 | 8.298 | | 5 | 32187 | 2.097 | 8.141 | | 6 | 38616 | 2.087 | 8.061 | | 7 | 45045 | 2.081 | 8.012 | | 8 | 51474 | 2.078 | 7.988 |