Upload checkpoint 8 model and tensorboard training logs

Browse files

Files changed (4) hide show

README.md +123 -14
config.json +1 -0
pytorch_model.bin +1 -1
runs/joined_logs/events.out.tfevents.1671528643.pop-os.46984.0 +3 -0

README.md CHANGED Viewed

@@ -2,23 +2,27 @@
 language: id
 license: mit
 datasets:
-- oscar
-- wikipedia
-- id_newspapers_2018
 widget:
-- text: "Saya [MASK] makan nasi goreng."
-- text: "Kucing itu sedang bermain dengan [MASK]."
 ---
 # Indonesian small BigBird model
-**Disclaimer:** This is work in progress. Current checkpoint is trained with ~7.0 epoch/45150 steps with 2.081 eval loss. Newer checkpoint and additional information will be added in the future.
 ## Model Description
-This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
 ```py
 config = BigBirdConfig(
     vocab_size = 30_000,
     hidden_size = 512,
@@ -33,11 +37,106 @@ config = BigBirdConfig(
 ## How to use
-> TBD
-## Limitations and bias
-> TBD
 ## Training and evaluation data
@@ -45,9 +144,19 @@ This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/dat
 ## Training Procedure
-> TBD
 ## Evaluation
-> TBD

 language: id
 license: mit
 datasets:
+  - oscar
+  - wikipedia
+  - id_newspapers_2018
 widget:
+  - text: "Saya [MASK] makan nasi goreng."
+  - text: "Kucing itu sedang bermain dengan [MASK]."
 ---
 # Indonesian small BigBird model
+## Source Code
+Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).
 ## Model Description
+This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.
 ```py
+from transformers import BigBirdConfig
 config = BigBirdConfig(
     vocab_size = 30_000,
     hidden_size = 512,
 ## How to use
+> Inference with Transformers pipeline (one MASK token)
+```py
+>>> from transformers import pipeline
+>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
+>>> pipe('Saya sedang bermain [MASK] teman saya.')
+[{'score': 0.7199566960334778,
+  'token': 14,
+  'token_str':'dengan',
+  'sequence': 'Saya sedang bermain dengan teman saya.'},
+ {'score': 0.12370546162128448,
+  'token': 17,
+  'token_str': 'untuk',
+  'sequence': 'Saya sedang bermain untuk teman saya.'},
+ {'score': 0.0385284349322319,
+  'token': 331,
+  'token_str': 'bersama',
+  'sequence': 'Saya sedang bermain bersama teman saya.'},
+ {'score': 0.012146958149969578,
+  'token': 28,
+  'token_str': 'oleh',
+  'sequence': 'Saya sedang bermain oleh teman saya.'},
+ {'score': 0.009499032981693745,
+  'token': 25,
+  'token_str': 'sebagai',
+  'sequence': 'Saya sedang bermain sebagai teman saya.'}]
+```
+> Inference with PyTorch (one or multiple MASK token)
+```py
+import torch
+from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
+from pprint import pprint
+tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
+model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
+topk = 5
+text = 'Saya [MASK] bermain [MASK] teman saya.'
+tokenized_text = tokenizer(text, return_tensors='pt')
+raw_output = model(**tokenized_text)
+tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
+score_output = torch.softmax(raw_output.logits, dim=2)
+result = []
+for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
+    if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
+        outputs = []
+        for token_idx in tokenized_output[0, position_idx]:
+            output = {}
+            output['score'] = score_output[0, position_idx, token_idx].item()
+            output['token'] = token_idx.item()
+            output['token_str'] = tokenizer.decode(output['token'])
+            outputs.append(output)
+        result.append(outputs)
+pprint(result)
+```
+```py
+[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
+  {'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
+  {'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
+  {'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
+  {'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
+ [{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
+  {'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
+  {'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
+  {'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
+  {'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
+```
+## Limitations and bias
+Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,
+```py
+>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
+[{'score': 0.16381049156188965,
+  'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
+  'token': 4910,
+  'token_str': 'budak'},
+ {'score': 0.1334381103515625,
+  'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
+  'token': 649,
+  'token_str': 'wanita'},
+ {'score': 0.11588197946548462,
+  'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
+  'token': 6368,
+  'token_str': 'lelaki'},
+ {'score': 0.061377108097076416,
+  'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
+  'token': 258,
+  'token_str': 'diri'},
+ {'score': 0.04679233580827713,
+  'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
+  'token': 6845,
+  'token_str': 'gadis'}]
+```
 ## Training and evaluation data
 ## Training Procedure
+The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.
 ## Evaluation
+The model achieve the following result during training evaluation.
+| Epoch | Steps | Eval. loss | Eval. perplexity |
+| ----- | ----- | ---------- | ---------------- |
+| 1     | 6249  | 2.466      | 11.775           |
+| 2     | 12858 | 2.265      | 9.631            |
+| 3     | 19329 | 2.127      | 8.390            |
+| 4     | 25758 | 2.116      | 8.298            |
+| 5     | 32187 | 2.097      | 8.141            |
+| 6     | 38616 | 2.087      | 8.061            |
+| 7     | 45045 | 2.081      | 8.012            |
+| 8     | 51474 | 2.078      | 7.988            |

config.json CHANGED Viewed

@@ -1,4 +1,5 @@
 {
   "architectures": [
     "BigBirdForMaskedLM"
   ],

 {
+  "_name_or_path": "/mnt/encrypted_database/sum_nlp/checkpoint-model-bigbird-small-indonesian/checkpoint-12900-only-model",
   "architectures": [
     "BigBirdForMaskedLM"
   ],

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:41566374480eadd6f8e5968f7373e409df1616d0cd7b77847f4d208d32df8ed7
 size 122558078

 version https://git-lfs.github.com/spec/v1
+oid sha256:7bc9c9edd2ba57a1c7daf77bdd003806a0857b1515a023f137b483e9fcfc0837
 size 122558078

runs/joined_logs/events.out.tfevents.1671528643.pop-os.46984.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:201771a39450395ab8c900f433586bd0a487438d9c0dfdd8db0c28a45c3b2c07
+size 316301