File size: 6,843 Bytes
82481db
2f5fb80
82481db
2f5fb80
c57096b
 
 
2f5fb80
c57096b
 
 
82481db
2f5fb80
 
 
3f0d9f8
 
 
2f5fb80
c57096b
 
 
 
2f5fb80
 
3f0d9f8
2f5fb80
 
3f0d9f8
 
2f5fb80
 
 
 
 
 
 
 
 
 
 
 
 
 
3f0d9f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f5fb80
3f0d9f8
2f5fb80
3f0d9f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f5fb80
 
 
 
 
 
 
3f0d9f8
 
2f5fb80
 
3f0d9f8
 
 
 
 
 
 
 
 
 
 
c57096b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: Saya [MASK] makan nasi goreng.
- text: Kucing itu sedang bermain dengan [MASK].
pipeline_tag: fill-mask
---

# Indonesian small BigBird model

## Source Code

Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).

## Downstream Task

* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli)

## Model Description

This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.

```py
from transformers import BigBirdConfig

config = BigBirdConfig(
    vocab_size = 30_000,
    hidden_size = 512,
    num_hidden_layers = 4,
    num_attention_heads = 8,
    intermediate_size = 2048,
    max_position_embeddings = 4096,
    is_encoder_decoder=False,
    attention_type='block_sparse'
)
```

## How to use

> Inference with Transformers pipeline (one MASK token)

```py
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
>>> pipe('Saya sedang bermain [MASK] teman saya.')
[{'score': 0.7199566960334778,
  'token': 14,
  'token_str':'dengan',
  'sequence': 'Saya sedang bermain dengan teman saya.'},
 {'score': 0.12370546162128448,
  'token': 17,
  'token_str': 'untuk',
  'sequence': 'Saya sedang bermain untuk teman saya.'},
 {'score': 0.0385284349322319,
  'token': 331,
  'token_str': 'bersama',
  'sequence': 'Saya sedang bermain bersama teman saya.'},
 {'score': 0.012146958149969578,
  'token': 28,
  'token_str': 'oleh',
  'sequence': 'Saya sedang bermain oleh teman saya.'},
 {'score': 0.009499032981693745,
  'token': 25,
  'token_str': 'sebagai',
  'sequence': 'Saya sedang bermain sebagai teman saya.'}]
```

> Inference with PyTorch (one or multiple MASK token)

```py
import torch
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
from pprint import pprint

tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
topk = 5
text = 'Saya [MASK] bermain [MASK] teman saya.'

tokenized_text = tokenizer(text, return_tensors='pt')
raw_output = model(**tokenized_text)
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
score_output = torch.softmax(raw_output.logits, dim=2)

result = []
for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
    if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
        outputs = []
        for token_idx in tokenized_output[0, position_idx]:
            output = {}
            output['score'] = score_output[0, position_idx, token_idx].item()
            output['token'] = token_idx.item()
            output['token_str'] = tokenizer.decode(output['token'])
            outputs.append(output)
        result.append(outputs)

pprint(result)
```

```py
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
  {'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
  {'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
  {'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
  {'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
 [{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
  {'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
  {'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
  {'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
  {'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
```

## Limitations and bias

Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,

```py
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
[{'score': 0.16381049156188965,
  'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
  'token': 4910,
  'token_str': 'budak'},
 {'score': 0.1334381103515625,
  'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
  'token': 649,
  'token_str': 'wanita'},
 {'score': 0.11588197946548462,
  'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
  'token': 6368,
  'token_str': 'lelaki'},
 {'score': 0.061377108097076416,
  'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
  'token': 258,
  'token_str': 'diri'},
 {'score': 0.04679233580827713,
  'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
  'token': 6845,
  'token_str': 'gadis'}]
```

## Training and evaluation data

This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.

## Training Procedure

The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.

## Evaluation

The model achieve the following result during training evaluation.

| Epoch | Steps | Eval. loss | Eval. perplexity |
| ----- | ----- | ---------- | ---------------- |
| 1     | 6249  | 2.466      | 11.775           |
| 2     | 12858 | 2.265      | 9.631            |
| 3     | 19329 | 2.127      | 8.390            |
| 4     | 25758 | 2.116      | 8.298            |
| 5     | 32187 | 2.097      | 8.141            |
| 6     | 38616 | 2.087      | 8.061            |
| 7     | 45045 | 2.081      | 8.012            |
| 8     | 51474 | 2.078      | 7.988            |