Fine-tuned bert-base-cased model for claims checkworthiness binary classification
The task is formulated as a binary classification task of determining if the claim (text) is worth fact-checking.
This model is a fine-tuned version of the BERT base cased model. The model was finetuded on a ClaimBuster dataset (http://doi.org/10.5281/zenodo.3609356). For the training we used only 0 and 1 labels, corresponding to Yes and No decision on whether the claim is considered to be chack-worthy.
The model was trained on the full dataset after the evaluation
Usage
BertForSequenceClassification
from transformers import BertTokenizer,BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("yevhenkost/claimbuster-yesno-binary-bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("yevhenkost/claimbuster-yesno-binary-bert-base-cased")
text_inputs = ["The water is wet"]
model_inputs = tokenizer(text_inputs, return_tensors="pt")
# regular SequenceClassifierOutput
model_output = model(**model_inputs)
# model_output.logits tensor([[-0.2657, 0.0749]])
Pipeline
Training Process
Data Preparation
The files were donwloaded from the ClaimBuster url. The dataset was prepared in the following way:
import pandas as pd
from sklearn.model_selection import train_test_split
# read data
gt_df = pd.read_csv("groundtruth.csv")
cs_df = pd.read_csv("crowdsourced.csv")
# concatenate and filter labels
total_df = pd.concat(
[cs_df, gt_df]
)
total_df = total_df[total_df["Verdict"].isin([0,1])]
# split on train and test
train_df, test_df = train_test_split(total_df, test_size=0.2, random_state=2)
Test Result
precision recall f1-score support
No 0.74 0.57 0.65 485
Yes 0.83 0.91 0.87 1139
accuracy 0.81 1624
macro avg 0.79 0.74 0.76 1624
weighted avg 0.81 0.81 0.81 1624
- Downloads last month
- 5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.