File size: 3,882 Bytes

---
license: apache-2.0
language: 
  - ind
  - ace
  - ban
  - bjn
  - bug
  - gor
  - jav
  - min
  - msa
  - nia
  - sun
  - tet
language_bcp47:
  - jv-x-bms
datasets:
  - sabilmakbar/indo_wiki
  - acul3/KoPI-NLLB
  - uonlp/CulturaX
tags:
  - bert
---

# NusaBERT Base

[NusaBERT](https://arxiv.org/abs/2403.01817) Base is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:

- `eval_accuracy`: 0.6866
- `eval_loss`: 1.4876
- `perplexity`: 4.4266

This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-base](https://huggingface.co/LazarusNLP/NusaBERT-base) is released under Apache 2.0 license.

## Model Detail

- **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/)
- **Finetuned from**: [IndoBERT base p1](https://huggingface.co/indobenchmark/indobert-base-p1)
- **Model type**: Encoder-based BERT language model
- **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
- **Contact**: [LazarusNLP](https://lazarusnlp.github.io/)

## Use in 🤗Transformers

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
```

## Training Datasets

Around 16B tokens from the following corpora were used during pre-training.

- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)

## Training Hyperparameters

The following hyperparameters were used during training:

- `learning_rate`: 0.0003
- `train_batch_size`: 256
- `eval_batch_size`: 256
- `seed`: 42
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 24000
- `training_steps`: 500000

### Framework versions

- Transformers 4.37.2
- Pytorch 2.2.0+cu118
- Datasets 2.17.1
- Tokenizers 0.15.1

## Credits

NusaBERT Base is developed with love by:

<div style="display: flex;">
<a href="https://github.com/anantoj">
    <img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>

<a href="https://github.com/DavidSamuell">
    <img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>

<a href="https://github.com/stevenlimcorn">
    <img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>

<a href="https://github.com/w11wo">
    <img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
</a>
</div>

## Citation

```bib
@misc{wongso2024nusabert,
  title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
  author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
  year={2024},
  eprint={2403.01817},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
```