TsekTxt โ€” RoBERTa-Tagalog Taglish Misinformation Classifier

This model is one of three transformer models fine-tuned and evaluated side-by-side as part of TsekTxt, a thesis and capstone project on detecting misinformation in Taglish (Tagalog-English code-switched) social media text.

Part of the TsekTxt model family:

Model Base Architecture Hugging Face Repo
XLM-RoBERTa Multilingual, 100 languages chimsio/tsektxt-xlmr
RoBERTa-Tagalog (this model) Filipino-specific pretraining chimsio/tsektxt-roberta-tagalog
mBERT Multilingual BERT baseline chimsio/tsektxt-mbert

Live application: TsekTxt web app โ€” this model family is served via a FastAPI backend and used to classify user-submitted Taglish text/screenshots as Suspicious or Not Suspicious.

Training pipeline / research repo: tsektxt-model-training โ€” full data preprocessing, training, and comparative evaluation code for all three models.


Model Details

Model Description

This model is a fine-tuned version of jcblaise/roberta-tagalog-base for binary text classification, distinguishing Suspicious (potentially fake/misinformation) from Not Suspicious (credible) Taglish text. It represents the "language-specific depth" arm of a three-way comparison against multilingual models (XLM-RoBERTa, mBERT), testing whether Filipino-focused pretraining improves detection of code-switched misinformation compared to broader multilingual pretraining.

  • Developed by: Hans Jio Arca, as part of a capstone/thesis project
  • Model type: Transformer encoder, sequence classification (2 labels)
  • Language(s): Tagalog, English, and Taglish code-switched text
  • License: CC-BY-NC-4.0 (academic/research use; update if your institution requires otherwise)
  • Finetuned from model: jcblaise/roberta-tagalog-base

Model Sources


Uses

Direct Use

Classifying short-to-medium Taglish social media text as Suspicious or Not Suspicious, as part of the TsekTxt credibility-checking pipeline, alongside Integrated Gradients token attributions for explainability.

Out-of-Scope Use

  • Not intended as a sole/automated fact-checking authority.
  • Not evaluated on formal news articles, long-form documents, or non-Filipino contexts.
  • Not intended for moderation decisions with legal or reputational consequences without human review.

Bias, Risks, and Limitations

Same underlying training data as its sibling models (see below), so shares the same domain skew toward entertainment/political content present in the source datasets. As a Filipino-specific model, performance on heavily English-dominant code-switched text has not been separately isolated in evaluation โ€” see the comparative analysis notebook for how this model's errors differ from the multilingual models on English-heavy inputs.


How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "chimsio/tsektxt-roberta-tagalog"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs).item()
    # NOTE: confirm label convention (0 = Suspicious, 1 = Not Suspicious) against your dataset
    return {
        "label": "Not Suspicious" if label == 1 else "Suspicious",
        "confidence": round(probs[0][label].item() * 100, 2)
    }

print(predict("Napatunayan na ang bakuna ay nagdudulot ng microchip sa katawan!"))

Training Details

Training Data

Identical dataset, cleaning, and stratified 80/10/10 split (fixed seed 42) as used for the sibling XLM-RoBERTa and mBERT models, ensuring a fair three-way comparison. Combined dataset of ~25,400 labeled Taglish samples from the Fake News Filipino Dataset (Cruz, Tan & Cheng) and Philippine Fake News Corpus (Fernandez).

Training Procedure

  • Preprocessing: Deduplicated, null-dropped, tokenized with roberta-tagalog-base's native tokenizer, max sequence length 256.
  • Class imbalance handling: Weighted cross-entropy loss.

Training Hyperparameters

  • Learning rate: 2e-5
  • Batch size: 16 (train), 32 (eval)
  • Epochs: 4
  • Weight decay: 0.01
  • Hardware: NVIDIA T4 GPU (Google Colab)

Evaluation

Testing Data

Held-out stratified test split (2,540 samples), identical to the split used for sibling models.

Results

Class Precision Recall F1-score Support
Not Suspicious 0.95 0.96 0.96 909
Suspicious 0.98 0.97 0.98 1,631
Accuracy 0.97 2,540
Macro avg 0.96 0.97 0.97 2,540
Weighted avg 0.97 0.97 0.97 2,540

See the training repo's comparative analysis notebook for the full three-model comparison, including whether Filipino-specific pretraining outperforms multilingual pretraining on this task.


Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: ~1 hour
  • Cloud Provider: Google (Colab)
  • Compute Region: Unknown (Colab-assigned)

Technical Specifications

  • Model Architecture: RoBERTa-Tagalog-base, sequence classification head (2 labels)
  • Compute Infrastructure: Google Colab, single T4 GPU
  • Software: transformers, datasets, accelerate, PyTorch

Citation

@misc{cruz2021robertatagalog,
  title={RoBERTa Tagalog: Pretrained Language Model for Filipino},
  author={Cruz, Jan Christian Blaise and others},
}

Dataset citations: Cruz, Tan & Cheng (Fake News Filipino Dataset); Fernandez (Philippine Fake News Corpus).


Model Card Contact

Hans Jio Arca โ€” https://github.com/hansjio

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for chimsio/tsektxt-roberta-tagalog

Finetuned
(12)
this model