Fill-Mask
Transformers
TensorBoard
Safetensors
12 languages
bert
Inference Endpoints
Edit model card

NusaBERT Large

NusaBERT Large is a multilingual encoder-based language model based on the BERT architecture. We conducted continued pre-training on open-source corpora of sabilmakbar/indo_wiki, acul3/KoPI-NLLB, and uonlp/CulturaX. On a held-out subset of the corpus, our model achieved:

  • eval_accuracy: 0.7117
  • eval_loss: 1.3268
  • perplexity: 3.7690

This model was trained using the 🤗Transformers PyTorch framework. All training was done on an NVIDIA H100 GPU. LazarusNLP/NusaBERT-large is released under Apache 2.0 license.

Model Detail

  • Developed by: LazarusNLP
  • Finetuned from: IndoBERT Large p1
  • Model type: Encoder-based BERT language model
  • Language(s): Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
  • License: Apache 2.0
  • Contact: LazarusNLP

Use in 🤗Transformers

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_checkpoint = "LazarusNLP/NusaBERT-large"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Training Datasets

Around 16B tokens from the following corpora were used during pre-training.

Training Hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 256
  • eval_batch_size: 256
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 24000
  • training_steps: 500000

Framework versions

  • Transformers 4.38.1
  • Pytorch 2.2.0+cu118
  • Datasets 2.17.1
  • Tokenizers 0.15.2

Credits

NusaBERT Large is developed with love by:

Citation

@misc{wongso2024nusabert,
  title={NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, 
  author={Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo},
  year={2024},
  eprint={2403.01817},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
2
Safetensors
Model size
337M params
Tensor type
F32
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train LazarusNLP/NusaBERT-large

Collection including LazarusNLP/NusaBERT-large