Model Card for T5-base for Claim Summarization

This model can be used to summarize noisy claims on social media into clean and concise claims which can be used for downstream tasks in a fact-checking pipeline.

Model Details

This is the fine-tuned T5-base model with 'Pre-processed with Mention and Hashtag Run Removed (P-MRR-HRR)' preprocessing strategy detailed in Table 2 in the paper.

Model Description

Developed by: Varad Bhatnagar, Diptesh Kanojia and Kameswari Chebrolu
Model type: Summarization
Language(s) (NLP): English
Finetuned from model: https://huggingface.co/t5-base

Model Sources

Repository: https://github.com/varadhbhatnagar/FC-Claim-Det
Paper: https://aclanthology.org/2022.coling-1.259/

Tokenizer

Same as T5-base

Uses

Direct Use

English to English summarization on noisy fact-checking worthy claims found on social media.

Downstream Use

Can be used for other tasks in a fact-checking pipeline such as claim matching and evidence retrieval.

Bias, Risks, and Limitations

As the Google Fact Check Explorer is an ever growing and evolving system, the current Retrieval@k results may not exactly match those in the corresponding paper as those experiments were conducted in the month of April and May 2022.

Training Details

Training Data

Data

Training Procedure

Finetuning the pretrained T5-base model on the 567 pairs released in our paper.

Preprocessing

Pre-processed with Mention and Hashtag Run Removed (P-MRR-HRR). Apply this strategy on the input text before feeding it to model for summarization.

Evaluation

Retrieval@5 and Mean Reciprocal Recall scores are reported.

Results

Retrieval@5 = 28.75 MRR = 0.25

Further details can be found in the paper.

Other Models from same work

DBART

DPEGASUS

Citation

BibTeX:

@inproceedings{bhatnagar-etal-2022-harnessing,
    title = "Harnessing Abstractive Summarization for Fact-Checked Claim Detection",
    author = "Bhatnagar, Varad  and
      Kanojia, Diptesh  and
      Chebrolu, Kameswari",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.259",
    pages = "2934--2945",
    abstract = "Social media platforms have become new battlegrounds for anti-social elements, with misinformation being the weapon of choice. Fact-checking organizations try to debunk as many claims as possible while staying true to their journalistic processes but cannot cope with its rapid dissemination. We believe that the solution lies in partial automation of the fact-checking life cycle, saving human time for tasks which require high cognition. We propose a new workflow for efficiently detecting previously fact-checked claims that uses abstractive summarization to generate crisp queries. These queries can then be executed on a general-purpose retrieval system associated with a collection of previously fact-checked claims. We curate an abstractive text summarization dataset comprising noisy claims from Twitter and their gold summaries. It is shown that retrieval performance improves 2x by using popular out-of-the-box summarization models and 3x by fine-tuning them on the accompanying dataset compared to verbatim querying. Our approach achieves Recall@5 and MRR of 35{\%} and 0.3, compared to baseline values of 10{\%} and 0.1, respectively. Our dataset, code, and models are available publicly: https://github.com/varadhbhatnagar/FC-Claim-Det/.",
}

Model Card Authors

Varad Bhatnagar

Model Card Contact

Email: varadhbhatnagar@gmail.com

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import T5ForConditionalGeneration, T5TokenizerFast
hft = T5TokenizerFast.from_pretrained('varadhbhatnagar/fc-claim-det-T5-base')
hfm = T5ForConditionalGeneration.from_pretrained('varadhbhatnagar/fc-claim-det-T5-base').to(device)
row = 'hi satya my name is arman today i got this video which is being spread in whatsapp and it is being said that the all old age covid 19 patients are being killed in the government hospital kindly check the facts'

tokenized_text = hft.encode(row, return_tensors="pt")
summary_ids = hfm.generate(tokenized_text,
                                  num_beams=6,
                                  no_repeat_ngram_size=2,
                                  min_length=5,
                                  max_length=15,
                                  early_stopping=True)

output = hft.decode(summary_ids[0], skip_special_tokens=True)