ilos-vigil commited on
Commit
3f0d9f8
1 Parent(s): a397d92

Upload checkpoint 8 model and tensorboard training logs

Browse files
README.md CHANGED
@@ -2,23 +2,27 @@
2
  language: id
3
  license: mit
4
  datasets:
5
- - oscar
6
- - wikipedia
7
- - id_newspapers_2018
8
  widget:
9
- - text: "Saya [MASK] makan nasi goreng."
10
- - text: "Kucing itu sedang bermain dengan [MASK]."
11
  ---
12
 
13
  # Indonesian small BigBird model
14
 
15
- **Disclaimer:** This is work in progress. Current checkpoint is trained with ~7.0 epoch/45150 steps with 2.081 eval loss. Newer checkpoint and additional information will be added in the future.
 
 
16
 
17
  ## Model Description
18
 
19
- This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
20
 
21
  ```py
 
 
22
  config = BigBirdConfig(
23
  vocab_size = 30_000,
24
  hidden_size = 512,
@@ -33,11 +37,106 @@ config = BigBirdConfig(
33
 
34
  ## How to use
35
 
36
- > TBD
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- ## Limitations and bias
39
 
40
- > TBD
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Training and evaluation data
43
 
@@ -45,9 +144,19 @@ This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/dat
45
 
46
  ## Training Procedure
47
 
48
- > TBD
49
-
50
  ## Evaluation
51
 
52
- > TBD
53
-
 
 
 
 
 
 
 
 
 
 
 
2
  language: id
3
  license: mit
4
  datasets:
5
+ - oscar
6
+ - wikipedia
7
+ - id_newspapers_2018
8
  widget:
9
+ - text: "Saya [MASK] makan nasi goreng."
10
+ - text: "Kucing itu sedang bermain dengan [MASK]."
11
  ---
12
 
13
  # Indonesian small BigBird model
14
 
15
+ ## Source Code
16
+
17
+ Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).
18
 
19
  ## Model Description
20
 
21
+ This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.
22
 
23
  ```py
24
+ from transformers import BigBirdConfig
25
+
26
  config = BigBirdConfig(
27
  vocab_size = 30_000,
28
  hidden_size = 512,
 
37
 
38
  ## How to use
39
 
40
+ > Inference with Transformers pipeline (one MASK token)
41
+
42
+ ```py
43
+ >>> from transformers import pipeline
44
+ >>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
45
+ >>> pipe('Saya sedang bermain [MASK] teman saya.')
46
+ [{'score': 0.7199566960334778,
47
+ 'token': 14,
48
+ 'token_str':'dengan',
49
+ 'sequence': 'Saya sedang bermain dengan teman saya.'},
50
+ {'score': 0.12370546162128448,
51
+ 'token': 17,
52
+ 'token_str': 'untuk',
53
+ 'sequence': 'Saya sedang bermain untuk teman saya.'},
54
+ {'score': 0.0385284349322319,
55
+ 'token': 331,
56
+ 'token_str': 'bersama',
57
+ 'sequence': 'Saya sedang bermain bersama teman saya.'},
58
+ {'score': 0.012146958149969578,
59
+ 'token': 28,
60
+ 'token_str': 'oleh',
61
+ 'sequence': 'Saya sedang bermain oleh teman saya.'},
62
+ {'score': 0.009499032981693745,
63
+ 'token': 25,
64
+ 'token_str': 'sebagai',
65
+ 'sequence': 'Saya sedang bermain sebagai teman saya.'}]
66
+ ```
67
 
68
+ > Inference with PyTorch (one or multiple MASK token)
69
 
70
+ ```py
71
+ import torch
72
+ from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
73
+ from pprint import pprint
74
+
75
+ tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
76
+ model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
77
+ topk = 5
78
+ text = 'Saya [MASK] bermain [MASK] teman saya.'
79
+
80
+ tokenized_text = tokenizer(text, return_tensors='pt')
81
+ raw_output = model(**tokenized_text)
82
+ tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
83
+ score_output = torch.softmax(raw_output.logits, dim=2)
84
+
85
+ result = []
86
+ for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
87
+ if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
88
+ outputs = []
89
+ for token_idx in tokenized_output[0, position_idx]:
90
+ output = {}
91
+ output['score'] = score_output[0, position_idx, token_idx].item()
92
+ output['token'] = token_idx.item()
93
+ output['token_str'] = tokenizer.decode(output['token'])
94
+ outputs.append(output)
95
+ result.append(outputs)
96
+
97
+ pprint(result)
98
+ ```
99
+
100
+ ```py
101
+ [[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
102
+ {'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
103
+ {'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
104
+ {'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
105
+ {'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
106
+ [{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
107
+ {'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
108
+ {'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
109
+ {'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
110
+ {'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
111
+ ```
112
+
113
+ ## Limitations and bias
114
+
115
+ Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,
116
+
117
+ ```py
118
+ >>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
119
+ [{'score': 0.16381049156188965,
120
+ 'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
121
+ 'token': 4910,
122
+ 'token_str': 'budak'},
123
+ {'score': 0.1334381103515625,
124
+ 'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
125
+ 'token': 649,
126
+ 'token_str': 'wanita'},
127
+ {'score': 0.11588197946548462,
128
+ 'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
129
+ 'token': 6368,
130
+ 'token_str': 'lelaki'},
131
+ {'score': 0.061377108097076416,
132
+ 'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
133
+ 'token': 258,
134
+ 'token_str': 'diri'},
135
+ {'score': 0.04679233580827713,
136
+ 'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
137
+ 'token': 6845,
138
+ 'token_str': 'gadis'}]
139
+ ```
140
 
141
  ## Training and evaluation data
142
 
 
144
 
145
  ## Training Procedure
146
 
147
+ The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.
148
+
149
  ## Evaluation
150
 
151
+ The model achieve the following result during training evaluation.
152
+
153
+ | Epoch | Steps | Eval. loss | Eval. perplexity |
154
+ | ----- | ----- | ---------- | ---------------- |
155
+ | 1 | 6249 | 2.466 | 11.775 |
156
+ | 2 | 12858 | 2.265 | 9.631 |
157
+ | 3 | 19329 | 2.127 | 8.390 |
158
+ | 4 | 25758 | 2.116 | 8.298 |
159
+ | 5 | 32187 | 2.097 | 8.141 |
160
+ | 6 | 38616 | 2.087 | 8.061 |
161
+ | 7 | 45045 | 2.081 | 8.012 |
162
+ | 8 | 51474 | 2.078 | 7.988 |
config.json CHANGED
@@ -1,4 +1,5 @@
1
  {
 
2
  "architectures": [
3
  "BigBirdForMaskedLM"
4
  ],
 
1
  {
2
+ "_name_or_path": "/mnt/encrypted_database/sum_nlp/checkpoint-model-bigbird-small-indonesian/checkpoint-12900-only-model",
3
  "architectures": [
4
  "BigBirdForMaskedLM"
5
  ],
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:41566374480eadd6f8e5968f7373e409df1616d0cd7b77847f4d208d32df8ed7
3
  size 122558078
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7bc9c9edd2ba57a1c7daf77bdd003806a0857b1515a023f137b483e9fcfc0837
3
  size 122558078
runs/joined_logs/events.out.tfevents.1671528643.pop-os.46984.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:201771a39450395ab8c900f433586bd0a487438d9c0dfdd8db0c28a45c3b2c07
3
+ size 316301