---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: Saya [MASK] makan nasi goreng.
- text: Kucing itu sedang bermain dengan [MASK].
pipeline_tag: fill-mask
---

# Indonesian small BigBird model

## Source Code

Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).

## Downstream Task

* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli)

## Model Description

This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.

```py
from transformers import BigBirdConfig

config = BigBirdConfig(
    vocab_size = 30_000,
    hidden_size = 512,
    num_hidden_layers = 4,
    num_attention_heads = 8,
    intermediate_size = 2048,
    max_position_embeddings = 4096,
    is_encoder_decoder=False,
    attention_type='block_sparse'
)
```

## How to use

> Inference with Transformers pipeline (one MASK token)

```py
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
>>> pipe('Saya sedang bermain [MASK] teman saya.')
[{'score': 0.7199566960334778,
  'token': 14,
  'token_str':'dengan',
  'sequence': 'Saya sedang bermain dengan teman saya.'},
 {'score': 0.12370546162128448,
  'token': 17,
  'token_str': 'untuk',
  'sequence': 'Saya sedang bermain untuk teman saya.'},
 {'score': 0.0385284349322319,
  'token': 331,
  'token_str': 'bersama',
  'sequence': 'Saya sedang bermain bersama teman saya.'},
 {'score': 0.012146958149969578,
  'token': 28,
  'token_str': 'oleh',
  'sequence': 'Saya sedang bermain oleh teman saya.'},
 {'score': 0.009499032981693745,
  'token': 25,
  'token_str': 'sebagai',
  'sequence': 'Saya sedang bermain sebagai teman saya.'}]
```

> Inference with PyTorch (one or multiple MASK token)

```py
import torch
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
from pprint import pprint

tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
topk = 5
text = 'Saya [MASK] bermain [MASK] teman saya.'

tokenized_text = tokenizer(text, return_tensors='pt')
raw_output = model(**tokenized_text)
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
score_output = torch.softmax(raw_output.logits, dim=2)

result = []
for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
    if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
        outputs = []
        for token_idx in tokenized_output[0, position_idx]:
            output = {}
            output['score'] = score_output[0, position_idx, token_idx].item()
            output['token'] = token_idx.item()
            output['token_str'] = tokenizer.decode(output['token'])
            outputs.append(output)
        result.append(outputs)

pprint(result)
```

```py
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
  {'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
  {'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
  {'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
  {'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
 [{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
  {'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
  {'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
  {'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
  {'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
```

## Limitations and bias

Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,

```py
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
[{'score': 0.16381049156188965,
  'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
  'token': 4910,
  'token_str': 'budak'},
 {'score': 0.1334381103515625,
  'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
  'token': 649,
  'token_str': 'wanita'},
 {'score': 0.11588197946548462,
  'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
  'token': 6368,
  'token_str': 'lelaki'},
 {'score': 0.061377108097076416,
  'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
  'token': 258,
  'token_str': 'diri'},
 {'score': 0.04679233580827713,
  'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
  'token': 6845,
  'token_str': 'gadis'}]
```

## Training and evaluation data

This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.

## Training Procedure

The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.

## Evaluation

The model achieve the following result during training evaluation.

| Epoch | Steps | Eval. loss | Eval. perplexity |
| ----- | ----- | ---------- | ---------------- |
| 1     | 6249  | 2.466      | 11.775           |
| 2     | 12858 | 2.265      | 9.631            |
| 3     | 19329 | 2.127      | 8.390            |
| 4     | 25758 | 2.116      | 8.298            |
| 5     | 32187 | 2.097      | 8.141            |
| 6     | 38616 | 2.087      | 8.061            |
| 7     | 45045 | 2.081      | 8.012            |
| 8     | 51474 | 2.078      | 7.988            |