Full Text Search - Hugging Face

Full-text search

models datasets spaces

416 results

factckbr

README.md

dataset

4 matches

tags: task_categories:text-classification, task_ids:fact-checking, annotations_creators:expert-generated, language_creators:found, multilinguality:monolingual, size_categories:1K<n<10K, source_datasets:original, language:pt, license:mit, croissant, region:us

tive fact check and classification.

The data is collected from the ClaimReview, a structured data schema used by fact check agencies to share their results in search engines, enabling data collect in real time.

The FACTCK.BR dataset contains 1309 claims with its corresponding label.

### Supported Tasks and Leaderboards

cl-nagoya / auto-wiki-qa

README.md

dataset

2 matches

tags: task_categories:question-answering, size_categories:1M<n<10M, language:ja, license:cc-by-sa-4.0, croissant, region:us

です。主にfact checkのため用意しているfieldです。

## 生成方法

LLMを用いた推論用のライブラリ[vLLM](https://github.com/vllm-project/vllm)を使用しました。

sem_eval_2020_task_11

README.md

dataset

2 matches

tags: task_categories:text-classification, task_categories:token-classification, annotations_creators:expert-generated, language_creators:found, multilinguality:monolingual, size_categories:n<1K, source_datasets:original, language:en, license:unknown, propaganda-span-identification, propaganda-technique-classification, arxiv:2009.02696, region:us

Bias/Fact Check,3

and we retrieved articles from these sources. We

deduplicated the articles on the basis of word n-grams matching (Barron-Cede ´ no and Rosso, 2009) and ˜

we discarded faulty entries (e.g., empty entries from blocking websites).

fake-news-UFG / FactChecksbr

README.md

dataset

10 matches

tags: task_categories:text-classification, size_categories:10K<n<100K, language:pt, license:mit, doi:10.57967/hf/1016, region:us

# FactChecks.br

## Dataset Description

Collection of Portuguese Fact-Checking Benchmarks.

Cofacts / line-msg-fact-check-tw

README.md

dataset

7 matches

tags: task_categories:text-classification, task_categories:question-answering, size_categories:100K<n<1M, language:zh, license:cc-by-sa-4.0, fact-checking, crowd-sourcing, croissant, region:us

rced Fact-Check Replies

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qdE-OMJTi6ZO68J6KdzGdxNdheW4ct6T?usp=sharing)

The Cofacts dataset encompasses instant messages that have been reported by users of the [Cofacts chatbot](https://line.me/R/ti/p/@cofacts) and the replies provided by the [Cofacts crowd-sourced fact-checking community](https://www.facebook.com/groups/cofacts/).

akozlova / RuFacts

README.md

dataset

4 matches

tags: task_categories:text-classification, size_categories:1K<n<10K, language:ru, license:cc-by-4.0, fact-checking, croissant, region:us

rnal fact-checking for the Russian language. The dataset contains tagged examples labeled consistent and inconsistent.

For inconsistent examples, ranges containing violations of facts in the source text and the generated text are also collected and presented on the [Kaggle competition page](https://www.kaggle.com/competitions/internal-fact-checking-for-the-russian-language).

Various data sources and approaches for data generation were used to create the training and test datasets for the fact-checking task. We consider the data on the sentence level and small texts. The average length of texts is 198 symbols, the minimum is 10 symbols, and the maximum is 3,402 symbols.

ctu-aic / csfever_v2

README.md

dataset

5 matches

tags: task_categories:text-classification, task_categories:text-retrieval, task_ids:natural-language-inference, task_ids:document-retrieval, multilinguality:monolingual, size_categories:100K<n<1M, source_datasets:fever, language:cs, license:cc-by-sa-3.0, Fact-checking, arxiv:2201.11115, region:us

zech fact-checking developed as part of a bachelor thesis at the Artificial Intelligence Center of the Faculty of Electrical Engineering of

the Czech technical university in Prague. The dataset consists of an **original** subset, which is only an iteration of CsFEVER with new data and better processing and

**f1**, **precision**, and **07** subsets filtered using an NLI model and optimized threshold values. The subset **wiki_pages** is a processed Wikipedia dump from

August 2022 with correct revids. This subset should be used to map evidence from datasets to Wikipedia texts. Additionaly preprocessed datasets **original_nli**, **f1_nli**, **precision_nli**, **07_nli**,

for training of NLI models are included.

Gameselo / monolingual-wideNLI

README.md

dataset

3 matches

tags: task_categories:text-classification, size_categories:100M<n<1B, language:en, natural-language-inference, fact-checking, croissant, region:us

arly Fact-Checking oriented.

Dev split is oriented to teach the model how to deal well with pure NLI (ANLI is well designed for this task) and test his general knowledge (Fact-Checking skills) with VitaminC, which is known for its robustness for this task.

- 14.5k examples for the dev split of which:

kundank / usb

README.md

dataset

1 matches

tags: task_categories:summarization, size_categories:1K<n<10K, language:en, license:apache-2.0, factchecking, summarization, nli, region:us

# USB: A Unified Summarization Benchmark Across Tasks and Domains

This benchmark contains labeled datasets for 8 text summarization based tasks given below.

The labeled datasets are created by collecting manual annotations on top of Wikipedia articles from 6 different domains.

copenlu / spanex

README.md

dataset

2 matches

tags: task_categories:text-classification, size_categories:1K<n<10K, language:en, license:mit, rationale-extraction, reasoning, nli, fact-checking, explainability, croissant, region:us

h as fact-checking (FC), machine reading comprehension (MRC) or natural language inference (NLI). However, existing highlight-based explanations primarily focus on identifying individual important features or interactions only between adjacent tokens or tuples of tokens. Most notably, there is a lack of annotations capturing the human decision-making process with respect to the necessary interactions for informed decision-making in such tasks. To bridge this gap, we introduce SpanEx, a multi-annotator dataset of human span interaction explanations for two NLU tasks: NLI and FC. We then investigate the decision-making processes of multiple fine-tuned large language models in terms of the employed connections between spans in separate parts of the input and compare them to the human reasoning processes. Finally, we present a novel community detection based unsupervised method to extract such interaction explanations. We make the code and the dataset available on [Github](https://github.com/copenlu/spanex). The dataset is also available on [Huggingface datasets](https://huggingface.co/datasets/copenlu/spanex).",

SEACrowd / x_fact

README.md

dataset

3 matches

tags: language:ara, language:aze, language:ben, language:deu, language:spa, language:fas, language:fra, language:guj, language:hin, language:ind, language:ita, language:kat, language:mar, language:nor, language:nld, language:pan, language:pol, language:por, language:ron, language:rus, language:sin, language:srp, language:sqi, language:tam, language:tur, license:mit, fact-checking, region:us

gual Fact Checking}},

author={Gupta, Ashim and Srikumar, Vivek},

booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",

ctu-aic / csfever

README.md

dataset

3 matches

tags: license:cc-by-sa-3.0, croissant, arxiv:1803.05355, arxiv:2201.11115, region:us

ntal Fact-Checking dataset

Czech dataset for fact verification localized from the data points of [FEVER](https://arxiv.org/abs/1803.05355) using the localization scheme described in the [CTKFacts: Czech Datasets for Fact Verification](https://arxiv.org/abs/2201.11115) paper which is currently being revised for publication in LREV journal.

The version you are looking at was reformatted to *Claim*-*Evidence* string pairs for the specific task of NLI - a more general Document-Retrieval-ready interpretation of our datapoints which can be used for training and evaluating the DR models over the June 2016 wikipedia snapshot can be found in the [data_dr]() folder in the JSON Lines format.

amanrangapur / Fin-Fact

README.md

dataset

16 matches

tags: task_categories:text-classification, task_categories:text-generation, size_categories:1K<n<10K, language:en, license:apache-2.0, finance, arxiv:2309.08793, region:us

cial Fact-Checking Dataset</h1>

## Table of Contents

clu-ling / clupubhealth

README.md

dataset

1 matches

tags: task_categories:summarization, size_categories:1K<n<10K, size_categories:10K<n<100K, language:en, license:apache-2.0, medical, region:us

ALTH fact-checking dataset](https://github.com/neemakot/Health-Fact-Checking).

The PUBHEALTH dataset contains claims, explanations, and main texts. The explanations function as vetted summaries of the main texts. The CLUPubhealth dataset repurposes these fields into summaries and texts for use in training Summarization models such as Facebook's BART.

There are currently 4 dataset configs which can be called, each has three splits (see Usage):

justinqbui / covid_fact_checked_google_api

README.md

dataset

6 matches

tags: croissant, region:us

ogle Fact Checker API](https://toolbox.google.com/factcheck/explorer), using an automatic web scraper. 10,000 facts were pulled, but for the sake of simplicity, only ones were the ratings were singular words "false" or "true", were kept, which filtered it down to ~3000 fact checks, with about 90% of the facts being false.

annotations_creators:

- expert-generated

language_creators:

eduagarcia / FactNews

README.md

dataset

2 matches

tags: task_categories:text-classification, annotations_creators:expert-generated, language_creators:found, multilinguality:monolingual, size_categories:1K<n<10K, language:pt, language:por, license:unknown, subjectivity, mediabias, media-bias, croissant, region:us

the FactCheck dataset on HuggingFace, the original data is made avaliable by Vargas et. al, 2023 and can be downloaded from the link: https://github.com/franciellevargas/FactNews*

*Modifications:*

- *The "original" subset contains the unmodified original CSV*

- *The subsets for the task of "bias_prediction" and "factuality_prediction" were splited between train (70%) AND test (30%) by randomly selecting

lytang / LLM-AggreFact

README.md

dataset

5 matches

tags: size_categories:10K<n<100K, language:en, license:cc-by-nd-4.0, croissant, arxiv:2404.10774, arxiv:2402.13249, arxiv:2402.00559, arxiv:2311.09000, arxiv:2309.07852, arxiv:2310.12150, region:us

is a fact verification benchmark from the work ([GitHub Repo](https://github.com/Liyan06/MiniCheck)):

📃 **MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents** ([link](https://arxiv.org/pdf/2404.10774.pdf))

It aggregates 10 of the most up-to-date publicly available datasets on factual consistency evaluation across

GonzaloA / fake_news

README.md

dataset

1 matches

tags: croissant, region:us

- fact-checking

- intent-classification

pretty_name: GonzaloA / Fake News

ctu-aic / ctkfacts_nli

README.md

dataset

2 matches

tags: croissant, arxiv:2201.11115, region:us

d of fact-checking experiments concluded and described within the CsFEVER and [CTKFacts: Czech Datasets for Fact Verification](https://arxiv.org/abs/2201.11115) paper currently being revised for publication in LREV journal.

## Document retrieval version

Can be found at https://huggingface.co/datasets/ctu-aic/ctkfacts

ctu-aic / ctkfacts

README.md

dataset

2 matches

tags: license:cc-by-sa-3.0, croissant, arxiv:2201.11115, region:us

d of fact-checking experiments concluded and described within the [CsFEVER andCTKFacts: Acquiring Czech data for Fact Verification](https://arxiv.org/abs/2201.11115) paper currently being revised for publication in LREV journal.

Can be found at https://huggingface.co/datasets/ctu-aic/ctkfacts_nli