TsekTxt — XLM-RoBERTa Taglish Misinformation Classifier

This model is one of three transformer models fine-tuned and evaluated side-by-side as part of TsekTxt, a thesis and capstone project on detecting misinformation in Taglish (Tagalog-English code-switched) social media text.

Part of the TsekTxt model family:

Model Base Architecture Hugging Face Repo
XLM-RoBERTa (this model) Multilingual, 100 languages chimsio/tsektxt-xlmr
RoBERTa-Tagalog Filipino-specific pretraining chimsio/tsektxt-roberta-tagalog
mBERT Multilingual BERT baseline chimsio/tsektxt-mbert

Live application: TsekTxt web app — this model is served via a FastAPI backend and used to classify user-submitted Taglish text/screenshots as Suspicious or Not Suspicious.

Training pipeline / research repo: tsektxt-model-training — contains the full data preprocessing, training, and comparative evaluation code for all three models.


Model Details

Model Description

This model is a fine-tuned version of xlm-roberta-base for binary text classification, distinguishing Suspicious (potentially fake/misinformation) from Not Suspicious (credible) Taglish text. It is one of three models trained under identical conditions (same dataset, splits, hyperparameters) to compare how multilingual breadth (XLM-R, mBERT) versus language-specific depth (RoBERTa-Tagalog) affects misinformation detection in code-switched text.

  • Developed by: Hans Jio Arca, as part of a capstone/thesis project
  • Model type: Transformer encoder, sequence classification (2 labels)
  • Language(s): Tagalog, English, and Taglish code-switched text
  • License: CC-BY-NC-4.0 (academic/research use; update if your institution requires otherwise)
  • Finetuned from model: xlm-roberta-base

Model Sources


Uses

Direct Use

Classifying short-to-medium Taglish social media text (posts, captions, forwarded messages) as Suspicious or Not Suspicious. Intended for use as the classification backend of the TsekTxt credibility-checking application, where predictions are paired with Integrated Gradients token attributions for explainability.

Out-of-Scope Use

  • Not intended as a sole/automated fact-checking authority — outputs should support human judgment, not replace it.
  • Not evaluated on formal news articles, long-form documents, or languages/dialects outside Tagalog-English code-switching.
  • Not intended for moderation decisions with legal or reputational consequences without human review.

Bias, Risks, and Limitations

The training data originates from existing Filipino fake-news datasets, which may reflect the topics, time periods, and political contexts in which they were collected (e.g. entertainment gossip, showbiz news, and political content are overrepresented relative to other domains). Performance on domains/topics underrepresented in training data (e.g. health misinformation, scientific claims) has not been separately validated. Users should treat predictions as a supporting signal, not a definitive verdict.


How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "chimsio/tsektxt-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs).item()
    # NOTE: label 0 = Suspicious, label 1 = Not Suspicious (confirm against your dataset's convention)
    return {
        "label": "Not Suspicious" if label == 1 else "Suspicious",
        "confidence": round(probs[0][label].item() * 100, 2)
    }

print(predict("Napatunayan na ang bakuna ay nagdudulot ng microchip sa katawan!"))

Training Details

Training Data

Combined dataset of ~25,400 labeled Taglish text samples, drawn from the Fake News Filipino Dataset (Cruz, Tan & Cheng) and the Philippine Fake News Corpus (Fernandez), after deduplication and cleaning. Class distribution: ~64% label "Suspicious," ~36% "Not Suspicious." Split 80/10/10 (train/val/test), stratified by label, fixed random seed (42) shared across all three models in this comparison for a fair evaluation.

Training Procedure

  • Preprocessing: Deduplicated, null-dropped, tokenized with the model's native tokenizer, max sequence length 256.
  • Class imbalance handling: Weighted cross-entropy loss (weights inversely proportional to class frequency).

Training Hyperparameters

  • Learning rate: 2e-5
  • Batch size: 16 (train), 32 (eval)
  • Epochs: 4
  • Weight decay: 0.01
  • Optimizer/scheduler: HuggingFace Trainer defaults (AdamW)
  • Hardware: NVIDIA T4 GPU (Google Colab)

Evaluation

Testing Data

Held-out stratified test split (2,540 samples), never seen during training or validation.

Results

Class Precision Recall F1-score Support
Not Suspicious 0.98 0.93 0.95 909
Suspicious 0.96 0.99 0.97 1,631
Accuracy 0.97 2,540
Macro avg 0.97 0.96 0.96 2,540
Weighted avg 0.97 0.97 0.97 2,540

See the training repo's comparative analysis notebook for side-by-side results against RoBERTa-Tagalog and mBERT, including Integrated Gradients and SHAP attribution comparisons.


Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: ~1.3 hours
  • Cloud Provider: Google (Colab)
  • Compute Region: Unknown (Colab-assigned)

Technical Specifications

  • Model Architecture: XLM-RoBERTa-base (270M parameters), sequence classification head (2 labels)
  • Compute Infrastructure: Google Colab, single T4 GPU
  • Software: transformers, datasets, accelerate, PyTorch

Citation

If referencing this work academically, cite the underlying architecture and datasets:

@misc{conneau2020xlmr,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and others},
  year={2020}
}

Dataset citations: Cruz, Tan & Cheng (Fake News Filipino Dataset); Fernandez (Philippine Fake News Corpus).


Model Card Contact

Hans Jio Arca — https://github.com/hansjio

Downloads last month
29
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chimsio/tsektxt-xlmr

Finetuned
(4087)
this model