TsekTxt — XLM-RoBERTa Taglish Misinformation Classifier
This model is one of three transformer models fine-tuned and evaluated side-by-side as part of TsekTxt, a thesis and capstone project on detecting misinformation in Taglish (Tagalog-English code-switched) social media text.
Part of the TsekTxt model family:
| Model | Base Architecture | Hugging Face Repo |
|---|---|---|
| XLM-RoBERTa (this model) | Multilingual, 100 languages | chimsio/tsektxt-xlmr |
| RoBERTa-Tagalog | Filipino-specific pretraining | chimsio/tsektxt-roberta-tagalog |
| mBERT | Multilingual BERT baseline | chimsio/tsektxt-mbert |
Live application: TsekTxt web app — this model is served via a FastAPI backend and used to classify user-submitted Taglish text/screenshots as Suspicious or Not Suspicious.
Training pipeline / research repo: tsektxt-model-training — contains the full data preprocessing, training, and comparative evaluation code for all three models.
Model Details
Model Description
This model is a fine-tuned version of xlm-roberta-base for binary text classification, distinguishing Suspicious (potentially fake/misinformation) from Not Suspicious (credible) Taglish text. It is one of three models trained under identical conditions (same dataset, splits, hyperparameters) to compare how multilingual breadth (XLM-R, mBERT) versus language-specific depth (RoBERTa-Tagalog) affects misinformation detection in code-switched text.
- Developed by: Hans Jio Arca, as part of a capstone/thesis project
- Model type: Transformer encoder, sequence classification (2 labels)
- Language(s): Tagalog, English, and Taglish code-switched text
- License: CC-BY-NC-4.0 (academic/research use; update if your institution requires otherwise)
- Finetuned from model:
xlm-roberta-base
Model Sources
- Training code: tsektxt-model-training
- Application using this model: tsektxt-app
- Sibling models:
tsektxt-roberta-tagalog,tsektxt-mbert
Uses
Direct Use
Classifying short-to-medium Taglish social media text (posts, captions, forwarded messages) as Suspicious or Not Suspicious. Intended for use as the classification backend of the TsekTxt credibility-checking application, where predictions are paired with Integrated Gradients token attributions for explainability.
Out-of-Scope Use
- Not intended as a sole/automated fact-checking authority — outputs should support human judgment, not replace it.
- Not evaluated on formal news articles, long-form documents, or languages/dialects outside Tagalog-English code-switching.
- Not intended for moderation decisions with legal or reputational consequences without human review.
Bias, Risks, and Limitations
The training data originates from existing Filipino fake-news datasets, which may reflect the topics, time periods, and political contexts in which they were collected (e.g. entertainment gossip, showbiz news, and political content are overrepresented relative to other domains). Performance on domains/topics underrepresented in training data (e.g. health misinformation, scientific claims) has not been separately validated. Users should treat predictions as a supporting signal, not a definitive verdict.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "chimsio/tsektxt-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
label = torch.argmax(probs).item()
# NOTE: label 0 = Suspicious, label 1 = Not Suspicious (confirm against your dataset's convention)
return {
"label": "Not Suspicious" if label == 1 else "Suspicious",
"confidence": round(probs[0][label].item() * 100, 2)
}
print(predict("Napatunayan na ang bakuna ay nagdudulot ng microchip sa katawan!"))
Training Details
Training Data
Combined dataset of ~25,400 labeled Taglish text samples, drawn from the Fake News Filipino Dataset (Cruz, Tan & Cheng) and the Philippine Fake News Corpus (Fernandez), after deduplication and cleaning. Class distribution: ~64% label "Suspicious," ~36% "Not Suspicious." Split 80/10/10 (train/val/test), stratified by label, fixed random seed (42) shared across all three models in this comparison for a fair evaluation.
Training Procedure
- Preprocessing: Deduplicated, null-dropped, tokenized with the model's native tokenizer, max sequence length 256.
- Class imbalance handling: Weighted cross-entropy loss (weights inversely proportional to class frequency).
Training Hyperparameters
- Learning rate: 2e-5
- Batch size: 16 (train), 32 (eval)
- Epochs: 4
- Weight decay: 0.01
- Optimizer/scheduler: HuggingFace
Trainerdefaults (AdamW) - Hardware: NVIDIA T4 GPU (Google Colab)
Evaluation
Testing Data
Held-out stratified test split (2,540 samples), never seen during training or validation.
Results
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Not Suspicious | 0.98 | 0.93 | 0.95 | 909 |
| Suspicious | 0.96 | 0.99 | 0.97 | 1,631 |
| Accuracy | 0.97 | 2,540 | ||
| Macro avg | 0.97 | 0.96 | 0.96 | 2,540 |
| Weighted avg | 0.97 | 0.97 | 0.97 | 2,540 |
See the training repo's comparative analysis notebook for side-by-side results against RoBERTa-Tagalog and mBERT, including Integrated Gradients and SHAP attribution comparisons.
Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: ~1.3 hours
- Cloud Provider: Google (Colab)
- Compute Region: Unknown (Colab-assigned)
Technical Specifications
- Model Architecture: XLM-RoBERTa-base (270M parameters), sequence classification head (2 labels)
- Compute Infrastructure: Google Colab, single T4 GPU
- Software:
transformers,datasets,accelerate, PyTorch
Citation
If referencing this work academically, cite the underlying architecture and datasets:
@misc{conneau2020xlmr,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and others},
year={2020}
}
Dataset citations: Cruz, Tan & Cheng (Fake News Filipino Dataset); Fernandez (Philippine Fake News Corpus).
Model Card Contact
Hans Jio Arca — https://github.com/hansjio
- Downloads last month
- 29
Model tree for chimsio/tsektxt-xlmr
Base model
FacebookAI/xlm-roberta-base