TsekTxt — XLM-RoBERTa Taglish Misinformation Classifier

This model is one of three transformer models fine-tuned and evaluated side-by-side as part of TsekTxt, a thesis and capstone project on detecting misinformation in Taglish (Tagalog-English code-switched) social media text.

Part of the TsekTxt model family:

Model	Base Architecture	Hugging Face Repo
XLM-RoBERTa (this model)	Multilingual, 100 languages	`chimsio/tsektxt-xlmr`
RoBERTa-Tagalog	Filipino-specific pretraining	`chimsio/tsektxt-roberta-tagalog`
mBERT	Multilingual BERT baseline	`chimsio/tsektxt-mbert`

Live application: TsekTxt web app — this model is served via a FastAPI backend and used to classify user-submitted Taglish text/screenshots as Suspicious or Not Suspicious.

Training pipeline / research repo: tsektxt-model-training — contains the full data preprocessing, training, and comparative evaluation code for all three models.

Model Details

Model Description

This model is a fine-tuned version of xlm-roberta-base for binary text classification, distinguishing Suspicious (potentially fake/misinformation) from Not Suspicious (credible) Taglish text. It is one of three models trained under identical conditions (same dataset, splits, hyperparameters) to compare how multilingual breadth (XLM-R, mBERT) versus language-specific depth (RoBERTa-Tagalog) affects misinformation detection in code-switched text.

Developed by: Hans Jio Arca, as part of a capstone/thesis project
Model type: Transformer encoder, sequence classification (2 labels)
Language(s): Tagalog, English, and Taglish code-switched text
License: CC-BY-NC-4.0 (academic/research use; update if your institution requires otherwise)
Finetuned from model: xlm-roberta-base

Model Sources

Training code: tsektxt-model-training
Application using this model: tsektxt-app
Sibling models: tsektxt-roberta-tagalog, tsektxt-mbert

Uses

Direct Use

Classifying short-to-medium Taglish social media text (posts, captions, forwarded messages) as Suspicious or Not Suspicious. Intended for use as the classification backend of the TsekTxt credibility-checking application, where predictions are paired with Integrated Gradients token attributions for explainability.

Out-of-Scope Use

Not intended as a sole/automated fact-checking authority — outputs should support human judgment, not replace it.
Not evaluated on formal news articles, long-form documents, or languages/dialects outside Tagalog-English code-switching.
Not intended for moderation decisions with legal or reputational consequences without human review.

Bias, Risks, and Limitations

The training data originates from existing Filipino fake-news datasets, which may reflect the topics, time periods, and political contexts in which they were collected (e.g. entertainment gossip, showbiz news, and political content are overrepresented relative to other domains). Performance on domains/topics underrepresented in training data (e.g. health misinformation, scientific claims) has not been separately validated. Users should treat predictions as a supporting signal, not a definitive verdict.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "chimsio/tsektxt-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = torch.argmax(probs).item()
    # NOTE: label 0 = Suspicious, label 1 = Not Suspicious (confirm against your dataset's convention)
    return {
        "label": "Not Suspicious" if label == 1 else "Suspicious",
        "confidence": round(probs[0][label].item() * 100, 2)
    }

print(predict("Napatunayan na ang bakuna ay nagdudulot ng microchip sa katawan!"))

Training Details

Training Data

Combined dataset of ~25,400 labeled Taglish text samples, drawn from the Fake News Filipino Dataset (Cruz, Tan & Cheng) and the Philippine Fake News Corpus (Fernandez), after deduplication and cleaning. Class distribution: ~64% label "Suspicious," ~36% "Not Suspicious." Split 80/10/10 (train/val/test), stratified by label, fixed random seed (42) shared across all three models in this comparison for a fair evaluation.

Training Procedure

Preprocessing: Deduplicated, null-dropped, tokenized with the model's native tokenizer, max sequence length 256.
Class imbalance handling: Weighted cross-entropy loss (weights inversely proportional to class frequency).

Training Hyperparameters

Learning rate: 2e-5
Batch size: 16 (train), 32 (eval)
Epochs: 4
Weight decay: 0.01
Optimizer/scheduler: HuggingFace Trainer defaults (AdamW)
Hardware: NVIDIA T4 GPU (Google Colab)

Evaluation

Testing Data

Held-out stratified test split (2,540 samples), never seen during training or validation.

Results

Class	Precision	Recall	F1-score	Support
Not Suspicious	0.98	0.93	0.95	909
Suspicious	0.96	0.99	0.97	1,631
Accuracy			0.97	2,540
Macro avg	0.97	0.96	0.96	2,540
Weighted avg	0.97	0.97	0.97	2,540

See the training repo's comparative analysis notebook for side-by-side results against RoBERTa-Tagalog and mBERT, including Integrated Gradients and SHAP attribution comparisons.

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: ~1.3 hours
Cloud Provider: Google (Colab)
Compute Region: Unknown (Colab-assigned)

Technical Specifications

Model Architecture: XLM-RoBERTa-base (270M parameters), sequence classification head (2 labels)
Compute Infrastructure: Google Colab, single T4 GPU
Software: transformers, datasets, accelerate, PyTorch

Citation

If referencing this work academically, cite the underlying architecture and datasets:

@misc{conneau2020xlmr,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and others},
  year={2020}
}

Dataset citations: Cruz, Tan & Cheng (Fake News Filipino Dataset); Fernandez (Philippine Fake News Corpus).

Model Card Contact

Hans Jio Arca — https://github.com/hansjio

Downloads last month: 29

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for chimsio/tsektxt-xlmr

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4087)

this model