Update README.md

14f10ed over 2 years ago

3.78 kB

	---
	language: su
	tags:
	- sundanese-roberta-base
	license: mit
	datasets:
	- mc4
	- cc100
	- oscar
	- wikipedia
	widget:
	- text: "Budi nuju <mask> di sakola."
	---

	## Sundanese RoBERTa Base

	Sundanese RoBERTa Base is a masked language model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. It was trained on four datasets: [OSCAR](https://hf.co/datasets/oscar)'s `unshuffled_deduplicated_su` subset, the Sundanese [mC4](https://hf.co/datasets/mc4) subset, the Sundanese [CC100](https://hf.co/datasets/cc100) subset, and Sundanese [Wikipedia](https://su.wikipedia.org/).

	10% of the dataset is kept for evaluation purposes. The model was trained from scratch and achieved an evaluation loss of 1.952 and an evaluation accuracy of 63.98%.

	This model was trained using HuggingFace's Flax framework. All necessary scripts used for training could be found in the [Files and versions](https://hf.co/w11wo/sundanese-roberta-base/tree/main) tab, as well as the [Training metrics](https://hf.co/w11wo/sundanese-roberta-base/tensorboard) logged via Tensorboard.

	## Model

	\| Model \| #params \| Arch. \| Training/Validation data (text) \|
	\| ------------------------ \| ------- \| ------- \| ------------------------------------- \|
	\| `sundanese-roberta-base` \| 124M \| RoBERTa \| OSCAR, mC4, CC100, Wikipedia (758 MB) \|

	## Evaluation Results

	The model was trained for 50 epochs and the following is the final result once the training ended.

	\| train loss \| valid loss \| valid accuracy \| total time \|
	\| ---------- \| ---------- \| -------------- \| ---------- \|
	\| 1.965 \| 1.952 \| 0.6398 \| 6:24:51 \|

	## How to Use

	### As Masked Language Model

	```python
	from transformers import pipeline

	pretrained_name = "w11wo/sundanese-roberta-base"

	fill_mask = pipeline(
	"fill-mask",
	model=pretrained_name,
	tokenizer=pretrained_name
	)

	fill_mask("Budi nuju <mask> di sakola.")
	```

	### Feature Extraction in PyTorch

	```python
	from transformers import RobertaModel, RobertaTokenizerFast

	pretrained_name = "w11wo/sundanese-roberta-base"
	model = RobertaModel.from_pretrained(pretrained_name)
	tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

	prompt = "Budi nuju diajar di sakola."
	encoded_input = tokenizer(prompt, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Disclaimer

	Do consider the biases which came from all four datasets that may be carried over into the results of this model.

	## Author

	Sundanese RoBERTa Base was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/).

	## Citation Information

	```bib
	@article{rs-907893,
	author = {Wongso, Wilson
	and Lucky, Henry
	and Suhartono, Derwin},
	journal = {Journal of Big Data},
	year = {2022},
	month = {Feb},
	day = {26},
	abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
	issn = {2693-5015},
	doi = {10.21203/rs.3.rs-907893/v1},
	url = {https://doi.org/10.21203/rs.3.rs-907893/v1}
	}
	```