---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: "Saya [MASK] makan nasi goreng."
- text: "Kucing itu sedang bermain dengan [MASK]."
---

# Indonesian small BigBird model

**Disclaimer:** This is work in progress. Current checkpoint is trained with ~6.0 epoch/38700 steps with 2.087 eval loss. Newer checkpoint and additional information will be added in the future.

## Model Description

This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.

```py
config = BigBirdConfig(
    vocab_size = 30_000,
    hidden_size = 512,
    num_hidden_layers = 4,
    num_attention_heads = 8,
    intermediate_size = 2048,
    max_position_embeddings = 4096,
    is_encoder_decoder=False,
    attention_type='block_sparse'
)
```

## How to use

> TBD

## Limitations and bias 

> TBD

## Training and evaluation data

This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.

## Training Procedure

> TBD
 
## Evaluation

> TBD