ilos-vigil's picture
Update README.md
c57096b
---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: Saya [MASK] makan nasi goreng.
- text: Kucing itu sedang bermain dengan [MASK].
pipeline_tag: fill-mask
---
# Indonesian small BigBird model
## Source Code
Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).
## Downstream Task
* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli)
## Model Description
This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.
```py
from transformers import BigBirdConfig
config = BigBirdConfig(
vocab_size = 30_000,
hidden_size = 512,
num_hidden_layers = 4,
num_attention_heads = 8,
intermediate_size = 2048,
max_position_embeddings = 4096,
is_encoder_decoder=False,
attention_type='block_sparse'
)
```
## How to use
> Inference with Transformers pipeline (one MASK token)
```py
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
>>> pipe('Saya sedang bermain [MASK] teman saya.')
[{'score': 0.7199566960334778,
'token': 14,
'token_str':'dengan',
'sequence': 'Saya sedang bermain dengan teman saya.'},
{'score': 0.12370546162128448,
'token': 17,
'token_str': 'untuk',
'sequence': 'Saya sedang bermain untuk teman saya.'},
{'score': 0.0385284349322319,
'token': 331,
'token_str': 'bersama',
'sequence': 'Saya sedang bermain bersama teman saya.'},
{'score': 0.012146958149969578,
'token': 28,
'token_str': 'oleh',
'sequence': 'Saya sedang bermain oleh teman saya.'},
{'score': 0.009499032981693745,
'token': 25,
'token_str': 'sebagai',
'sequence': 'Saya sedang bermain sebagai teman saya.'}]
```
> Inference with PyTorch (one or multiple MASK token)
```py
import torch
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
from pprint import pprint
tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
topk = 5
text = 'Saya [MASK] bermain [MASK] teman saya.'
tokenized_text = tokenizer(text, return_tensors='pt')
raw_output = model(**tokenized_text)
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
score_output = torch.softmax(raw_output.logits, dim=2)
result = []
for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
outputs = []
for token_idx in tokenized_output[0, position_idx]:
output = {}
output['score'] = score_output[0, position_idx, token_idx].item()
output['token'] = token_idx.item()
output['token_str'] = tokenizer.decode(output['token'])
outputs.append(output)
result.append(outputs)
pprint(result)
```
```py
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
{'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
{'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
{'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
{'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
[{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
{'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
{'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
{'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
{'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
```
## Limitations and bias
Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,
```py
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
[{'score': 0.16381049156188965,
'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
'token': 4910,
'token_str': 'budak'},
{'score': 0.1334381103515625,
'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
'token': 649,
'token_str': 'wanita'},
{'score': 0.11588197946548462,
'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
'token': 6368,
'token_str': 'lelaki'},
{'score': 0.061377108097076416,
'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
'token': 258,
'token_str': 'diri'},
{'score': 0.04679233580827713,
'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
'token': 6845,
'token_str': 'gadis'}]
```
## Training and evaluation data
This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
## Training Procedure
The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.
## Evaluation
The model achieve the following result during training evaluation.
| Epoch | Steps | Eval. loss | Eval. perplexity |
| ----- | ----- | ---------- | ---------------- |
| 1 | 6249 | 2.466 | 11.775 |
| 2 | 12858 | 2.265 | 9.631 |
| 3 | 19329 | 2.127 | 8.390 |
| 4 | 25758 | 2.116 | 8.298 |
| 5 | 32187 | 2.097 | 8.141 |
| 6 | 38616 | 2.087 | 8.061 |
| 7 | 45045 | 2.081 | 8.012 |
| 8 | 51474 | 2.078 | 7.988 |