HiliSenti-v1-model

Model Description

HiliSenti-v1-model is a fine-tuned XLM‑RoBERTa‑large (355M parameters) model for ternary sentiment classification (Negative, Neutral, Positive) of Hiligaynon text. It was trained on the HiliSenti v1 dataset, the first publicly available multi-domain sentiment analysis dataset for Hiligaynon.

The model achieves 93.5% test accuracy and a macro F1 of 93.4%, with per-class F1 scores of 0.95 (Negative), 0.91 (Neutral), and 0.94 (Positive).

Intended Uses

Sentiment analysis of Hiligaynon customer feedback
Social media monitoring for Hiligaynon-speaking communities (Negros Occidental, Panay Island, Soccsksargen)
Educational and civic technology applications
Research on low-resource Philippine languages

Out-of-Scope Uses

Fine-grained emotion detection
Aspect-based sentiment analysis
High-stakes decision-making without human verification
Languages other than Hiligaynon (the model may still work to some extent, but is not optimized for them)

Training Data

The model was trained on the HiliSenti v1 dataset:

Split	Sentences	Negative	Neutral	Positive
Train	18,854	6,817	5,834	6,203
Validation	2,241	~810	~694	~737
Test	2,242	828	633	781
Total	23,337	~8,455	~7,161	~7,721

Data sources include:

Digicast Negros (news articles) – ~10,000 sentences
Facebook & Reddit (social media) – ~2,000 sentences
Cross-lingual translation (product reviews, student feedback) – ~5,600 sentences
Synthetic augmentation (edge cases: sarcasm, extreme sentiments) – ~5,600 sentences

Training Procedure

Base model: xlm-roberta-large (355M parameters)
Tokenizer: XLM‑RoBERTa‑large SentencePiece tokenizer with 43 custom Hiligaynon tokens added
Max sequence length: 128 tokens
Batch size: 16 (effective batch size 32 via gradient accumulation)
Learning rate: 2e-5 with cosine schedule and 10% warm-up
Optimizer: AdamW (fused)
Epochs: 5 (early stopping with patience 3)
Label smoothing: 0.1
Class weighting: Balanced weights applied to cross-entropy loss
Mixed precision: FP16
Compute: Google Colab free tier (Tesla T4 GPU) with manual checkpoint pruning to fit 15 GB Google Drive storage

Evaluation Results

Metric	Negative	Neutral	Positive	Overall
Precision	0.95	0.93	0.93	—
Recall	0.95	0.90	0.95	—
F1-Score	0.95	0.91	0.94	—
Accuracy	—	—	—	93.5%
Macro F1	—	—	—	93.4%
Balanced Accuracy	—	—	—	93.3%

The model substantially exceeds the original project target of 80% accuracy.

Limitations

Single annotator: Labels were assigned by a single native Hiligaynon speaker. A formal inter-annotator agreement study is planned for a future version.
Dialectal bias: The dataset is weighted toward Negros Occidental Hiligaynon; Panay and Soccsksargen varieties are under-represented.
Task scope: Only sentence-level ternary sentiment is supported.
Sarcasm: Despite synthetic augmentation, sarcasm detection remains imperfect.
No cross-lingual evaluation: Unlike HILIGAYNER, this model has not yet been evaluated on zero-shot transfer to Cebuano or Tagalog.

Ethical Considerations

All data was collected from publicly available sources (news websites, public social media posts, public Reddit communities) and existing publicly licensed datasets.
The dataset contains real names, locations, and descriptions of violence. A content warning is included in the dataset documentation.
Users are advised to apply their own anonymization or filtering pipelines if required by their institutional privacy regulations.

How to Use

Via Transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("jjjardev/hilisenti-v1-model")
tokenizer = AutoTokenizer.from_pretrained("jjjardev/hilisenti-v1-model")

# Example inference
sentence = "Sobrang sarap ng pagkain dito sa restaurant na ito."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()

# Map to label
labels = ["Negative", "Neutral", "Positive"]
print(labels[prediction])  # Positive

Using the Pipeline API

from transformers import pipeline

classifier = pipeline("text-classification", model="jjjardev/hilisenti-v1-model")
result = classifier("Napakabagal ng internet connection namin ngayon.")
print(result)  # [{'label': 'Negative', 'score': 0.93}]

Citation

If you use this model or the HiliSenti dataset in your research, please cite:

@misc{jessie_james_jarder_2026,
  author    = {Jessie James Jarder},
  title     = {hilisenti-v1-model (Revision 6df512f)},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.57967/hf/9302},
  url       = {https://huggingface.co/jjjardev/hilisenti-v1-model}
}

Dataset Reference:

@dataset{jarder2026hilisenti,
  author    = {Jessie James T. Jarder},
  title     = {HiliSenti: A Multi-Domain Sentiment Analysis Dataset for Hiligaynon},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.57967/hf/8737},
  url       = {https://huggingface.co/datasets/jjjardev/hilisenti-v1}
}

Licenses

Component	License
Model Weights	CC BY-NC-SA 4.0
Training Code	MIT
Dataset	CC BY-NC-SA 4.0

Related Resources

Dataset: jjjardev/hilisenti-v1
Code: github.com/jjjardev/hilisenti
Dataset DOI: 10.57967/hf/8737
Model DOI: 10.57967/hf/9302

Contact

Jessie James T. Jarder — jj.jarder.dev@gmail.com

---

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

jjjardev
/

hilisenti-v1-model