HiliSenti-v1-model

Model Description

HiliSenti-v1-model is a fine-tuned XLM‑RoBERTa‑large (355M parameters) model for ternary sentiment classification (Negative, Neutral, Positive) of Hiligaynon text. It was trained on the HiliSenti v1 dataset, the first publicly available multi-domain sentiment analysis dataset for Hiligaynon.

The model achieves 93.5% test accuracy and a macro F1 of 93.4%, with per-class F1 scores of 0.95 (Negative), 0.91 (Neutral), and 0.94 (Positive).

Intended Uses

  • Sentiment analysis of Hiligaynon customer feedback
  • Social media monitoring for Hiligaynon-speaking communities (Negros Occidental, Panay Island, Soccsksargen)
  • Educational and civic technology applications
  • Research on low-resource Philippine languages

Out-of-Scope Uses

  • Fine-grained emotion detection
  • Aspect-based sentiment analysis
  • High-stakes decision-making without human verification
  • Languages other than Hiligaynon (the model may still work to some extent, but is not optimized for them)

Training Data

The model was trained on the HiliSenti v1 dataset:

Split Sentences Negative Neutral Positive
Train 18,854 6,817 5,834 6,203
Validation 2,241 ~810 ~694 ~737
Test 2,242 828 633 781
Total 23,337 ~8,455 ~7,161 ~7,721

Data sources include:

  • Digicast Negros (news articles) – ~10,000 sentences
  • Facebook & Reddit (social media) – ~2,000 sentences
  • Cross-lingual translation (product reviews, student feedback) – ~5,600 sentences
  • Synthetic augmentation (edge cases: sarcasm, extreme sentiments) – ~5,600 sentences

Training Procedure

  • Base model: xlm-roberta-large (355M parameters)
  • Tokenizer: XLM‑RoBERTa‑large SentencePiece tokenizer with 43 custom Hiligaynon tokens added
  • Max sequence length: 128 tokens
  • Batch size: 16 (effective batch size 32 via gradient accumulation)
  • Learning rate: 2e-5 with cosine schedule and 10% warm-up
  • Optimizer: AdamW (fused)
  • Epochs: 5 (early stopping with patience 3)
  • Label smoothing: 0.1
  • Class weighting: Balanced weights applied to cross-entropy loss
  • Mixed precision: FP16
  • Compute: Google Colab free tier (Tesla T4 GPU) with manual checkpoint pruning to fit 15 GB Google Drive storage

Evaluation Results

Metric Negative Neutral Positive Overall
Precision 0.95 0.93 0.93 —
Recall 0.95 0.90 0.95 —
F1-Score 0.95 0.91 0.94 —
Accuracy — — — 93.5%
Macro F1 — — — 93.4%
Balanced Accuracy — — — 93.3%

The model substantially exceeds the original project target of 80% accuracy.

Limitations

  • Single annotator: Labels were assigned by a single native Hiligaynon speaker. A formal inter-annotator agreement study is planned for a future version.
  • Dialectal bias: The dataset is weighted toward Negros Occidental Hiligaynon; Panay and Soccsksargen varieties are under-represented.
  • Task scope: Only sentence-level ternary sentiment is supported.
  • Sarcasm: Despite synthetic augmentation, sarcasm detection remains imperfect.
  • No cross-lingual evaluation: Unlike HILIGAYNER, this model has not yet been evaluated on zero-shot transfer to Cebuano or Tagalog.

Ethical Considerations

  • All data was collected from publicly available sources (news websites, public social media posts, public Reddit communities) and existing publicly licensed datasets.
  • The dataset contains real names, locations, and descriptions of violence. A content warning is included in the dataset documentation.
  • Users are advised to apply their own anonymization or filtering pipelines if required by their institutional privacy regulations.

How to Use

Via Transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("jjjardev/hilisenti-v1-model")
tokenizer = AutoTokenizer.from_pretrained("jjjardev/hilisenti-v1-model")

# Example inference
sentence = "Sobrang sarap ng pagkain dito sa restaurant na ito."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()

# Map to label
labels = ["Negative", "Neutral", "Positive"]
print(labels[prediction])  # Positive

Using the Pipeline API

from transformers import pipeline

classifier = pipeline("text-classification", model="jjjardev/hilisenti-v1-model")
result = classifier("Napakabagal ng internet connection namin ngayon.")
print(result)  # [{'label': 'Negative', 'score': 0.93}]

Citation

If you use this model or the HiliSenti dataset in your research, please cite:

@misc{jessie_james_jarder_2026,
  author    = {Jessie James Jarder},
  title     = {hilisenti-v1-model (Revision 6df512f)},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.57967/hf/9302},
  url       = {https://huggingface.co/jjjardev/hilisenti-v1-model}
}

Dataset Reference:

@dataset{jarder2026hilisenti,
  author    = {Jessie James T. Jarder},
  title     = {HiliSenti: A Multi-Domain Sentiment Analysis Dataset for Hiligaynon},
  year      = {2026},
  publisher = {Hugging Face},
  doi       = {10.57967/hf/8737},
  url       = {https://huggingface.co/datasets/jjjardev/hilisenti-v1}
}

Licenses

Component License
Model Weights CC BY-NC-SA 4.0
Training Code MIT
Dataset CC BY-NC-SA 4.0

Related Resources

Contact

Jessie James T. Jarder — jj.jarder.dev@gmail.com


---
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jjjardev/hilisenti-v1-model