--- language: id license: mit datasets: - oscar - wikipedia - id_newspapers_2018 widget: - text: "Saya [MASK] makan nasi goreng." - text: "Kucing itu sedang bermain dengan [MASK]." --- # Indonesian small BigBird model **Disclaimer:** This is work in progress. Current checkpoint is trained with ~6.0 epoch/38700 steps with 2.087 eval loss. Newer checkpoint and additional information will be added in the future. ## Model Description This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size. ```py config = BigBirdConfig( vocab_size = 30_000, hidden_size = 512, num_hidden_layers = 4, num_attention_heads = 8, intermediate_size = 2048, max_position_embeddings = 4096, is_encoder_decoder=False, attention_type='block_sparse' ) ``` ## How to use > TBD ## Limitations and bias > TBD ## Training and evaluation data This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data. ## Training Procedure > TBD ## Evaluation > TBD