w11wo's picture
Update README.md
14f10ed
|
raw
history blame
3.78 kB
---
language: su
tags:
- sundanese-roberta-base
license: mit
datasets:
- mc4
- cc100
- oscar
- wikipedia
widget:
- text: "Budi nuju <mask> di sakola."
---
## Sundanese RoBERTa Base
Sundanese RoBERTa Base is a masked language model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).
10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 1.952 and an evaluation accuracy of 63.98%.
This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-roberta-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-roberta-base/tensorboard) logged via Tensorboard.
## Model
| Model | #params | Arch. | Training/Validation data (text) |
| ------------------------ | ------- | ------- | ------------------------------------- |
| `sundanese-roberta-base` | 124M | RoBERTa | OSCAR, mC4, CC100, Wikipedia (758 MB) |
## Evaluation Results
The model was trained for 50 epochs and the following is the final result once the training ended.
| train loss | valid loss | valid accuracy | total time |
| ---------- | ---------- | -------------- | ---------- |
| 1.965 | 1.952 | 0.6398 | 6:24:51 |
## How to Use
### As Masked Language Model
```python
from transformers import pipeline
pretrained_name = "w11wo/sundanese-roberta-base"
fill_mask = pipeline(
"fill-mask",
model=pretrained_name,
tokenizer=pretrained_name
)
fill_mask("Budi nuju <mask> di sakola.")
```
### Feature Extraction in PyTorch
```python
from transformers import RobertaModel, RobertaTokenizerFast
pretrained_name = "w11wo/sundanese-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)
prompt = "Budi nuju diajar di sakola."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)
```
## Disclaimer
Do consider the biases which came from all four datasets that may be carried over into the results of this model.
## Author
Sundanese RoBERTa Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).
## Citation Information
```bib
@article{rs-907893,
author = {Wongso, Wilson
and Lucky, Henry
and Suhartono, Derwin},
journal = {Journal of Big Data},
year = {2022},
month = {Feb},
day = {26},
abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
issn = {2693-5015},
doi = {10.21203/rs.3.rs-907893/v1},
url = {https://doi.org/10.21203/rs.3.rs-907893/v1}
}
```