LazarusNLP
/

NusaBERT-large

@@ -1,54 +1,78 @@
 ---
-license: mit
-base_model: indobenchmark/indobert-large-p1
 tags:
-- generated_from_trainer
-metrics:
-- accuracy
-model-index:
-- name: NusaBERT-large
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# NusaBERT-large
-This model is a fine-tuned version of [indobenchmark/indobert-large-p1](https://huggingface.co/indobenchmark/indobert-large-p1) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.3268
-- Accuracy: 0.7118
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 3e-05
-- train_batch_size: 256
-- eval_batch_size: 256
-- seed: 42
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 24000
-- training_steps: 500000
-### Training results
 ### Framework versions
@@ -56,3 +80,25 @@ The following hyperparameters were used during training:
 - Pytorch 2.2.0+cu118
 - Datasets 2.17.1
 - Tokenizers 0.15.2

 ---
+license: apache-2.0
+language:
+  - ind
+  - ace
+  - ban
+  - bjn
+  - bug
+  - gor
+  - jav
+  - min
+  - msa
+  - nia
+  - sun
+  - tet
+language_bcp47:
+  - jv-x-bms
+datasets:
+  - sabilmakbar/indo_wiki
+  - acul3/KoPI-NLLB
+  - uonlp/CulturaX
 tags:
+  - bert
 ---
+# NusaBERT Large
+NusaBERT Large is a multilingual encoder-based language model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. We conducted continued pre-training on open-source corpora of [sabilmakbar/indo_wiki](https://huggingface.co/datasets/sabilmakbar/indo_wiki), [acul3/KoPI-NLLB](https://huggingface.co/datasets/acul3/KoPI-NLLB), and [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved:
+- `eval_accuracy`: 0.7117
+- `eval_loss`: 1.3268
+- `perplexity`: 3.7690
+This model was trained using the [🤗Transformers](https://github.com/huggingface/transformers) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/NusaBERT-large](https://huggingface.co/LazarusNLP/NusaBERT-large) is released under Apache 2.0 license.
+## Model Detail
+- **Developed by**: [LazarusNLP](https://lazarusnlp.github.io/)
+- **Finetuned from**: [IndoBERT Large p1](https://huggingface.co/indobenchmark/indobert-large-p1)
+- **Model type**: Encoder-based BERT language model
+- **Language(s)**: Indonesian, Acehnese, Balinese, Banjarese, Buginese, Gorontalo, Javanese, Banyumasan, Minangkabau, Malay, Nias, Sundanese, Tetum
+- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
+- **Contact**: [LazarusNLP](https://lazarusnlp.github.io/)
+## Use in 🤗Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+model_checkpoint = "LazarusNLP/NusaBERT-large"
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
+```
+## Training Datasets
+Around 16B tokens from the following corpora were used during pre-training.
+- [Indonesian Wikipedia Data Repository](https://huggingface.co/datasets/sabilmakbar/indo_wiki)
+- [KoPI-NLLB (Korpus Perayapan Indonesia)](https://huggingface.co/datasets/acul3/KoPI-NLLB)
+- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)
+## Training Hyperparameters
+The following hyperparameters were used during training:
+- `learning_rate`: 3e-05
+- `train_batch_size`: 256
+- `eval_batch_size`: 256
+- `seed`: 42
+- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
+- `lr_scheduler_type`: linear
+- `lr_scheduler_warmup_steps`: 24000
+- `training_steps`: 500000
 ### Framework versions
 - Pytorch 2.2.0+cu118
 - Datasets 2.17.1
 - Tokenizers 0.15.2
+## Credits
+NusaBERT Large is developed with love by:
+<div style="display: flex;">
+<a href="https://github.com/anantoj">
+    <img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
+</a>
+<a href="https://github.com/DavidSamuell">
+    <img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
+</a>
+<a href="https://github.com/stevenlimcorn">
+    <img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
+</a>
+<a href="https://github.com/w11wo">
+    <img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
+</a>
+</div>