File size: 1,753 Bytes
82481db 2f5fb80 82481db 2f5fb80 82481db 2f5fb80 7f4402c 2f5fb80 8c7a782 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: "Saya [MASK] makan nasi goreng."
- text: "Kucing itu sedang bermain dengan [MASK]."
---
# Indonesian small BigBird model
**Disclaimer:** This is work in progress. Current checkpoint is trained with ~6.0 epoch/38700 steps with 2.087 eval loss. Newer checkpoint and additional information will be added in the future.
## Model Description
This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
```py
config = BigBirdConfig(
vocab_size = 30_000,
hidden_size = 512,
num_hidden_layers = 4,
num_attention_heads = 8,
intermediate_size = 2048,
max_position_embeddings = 4096,
is_encoder_decoder=False,
attention_type='block_sparse'
)
```
## How to use
> TBD
## Limitations and bias
> TBD
## Training and evaluation data
This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
## Training Procedure
> TBD
## Evaluation
> TBD
|