Chenxi Whitehouse commited on
Commit
ed1749d
·
1 Parent(s): 17da9c4
README.md CHANGED
@@ -7,10 +7,51 @@ license: apache-2.0
7
 
8
  Data, knowledge store and source code to reproduce the baseline experiments for the [AVeriTeC](https://arxiv.org/abs/2305.13117) dataset, which will be used for the 7th [FEVER](https://fever.ai/) workshop co-hosted at EMNLP 2024.
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- ### Set up environment
12
 
13
- ```
 
 
 
 
 
14
  conda create -n averitec python=3.11
15
  conda activate averitec
16
 
@@ -21,25 +62,56 @@ python -m nltk.downloader wordnet
21
  conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
22
  ```
23
 
24
- ### Scrape text from the URLs obtained by searching queries with the Google API
25
 
26
- We provide up to 1000 URLs for each claim returned from a Google API search using different queries. This is a courtesy aimed at reducing the cost of using the Google Search API for participants of the shared task. The URL files can be found [here](https://huggingface.co/chenxwh/AVeriTeC/tree/main/data_store/urls).
27
 
28
- You can use your own scraping tool to extract sentences from the URLs. Alternatively, we have included a scraping tool for this purpose, which can be executed as follows. The processed files are also provided and can be found [here](https://huggingface.co/chenxwh/AVeriTeC/tree/main/data_store/knowledge_store).
29
 
30
- ```
31
  bash script/scraper.sh <split> <start_idx> <end_idx>
32
  # e.g., bash script/scraper.sh dev 0 500
33
  ```
34
 
35
- ### Rank the sentences in the knowledge store with BM25, keep top 100 sentences for each claim
36
- See [bm25_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/bm25_sentences.py) for more argument options.
37
- ```
 
38
  python -m src.reranking.bm25_sentences
39
  ```
40
 
41
- ### Generate questions for each evidence sentence
42
- We use [BLOOM](https://huggingface.co/bigscience/bloom-7b1) to generate questions for each evidence sentence using the closet examples from the training set. See [question_generation_top_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/question_generation_top_sentences.py) for more argument options.
43
- ```
44
  python -m src.reranking.question_generation_top_sentences
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ```
 
7
 
8
  Data, knowledge store and source code to reproduce the baseline experiments for the [AVeriTeC](https://arxiv.org/abs/2305.13117) dataset, which will be used for the 7th [FEVER](https://fever.ai/) workshop co-hosted at EMNLP 2024.
9
 
10
+ ## Dataset
11
+ The training and dev dataset can be found under [data](https://huggingface.co/chenxwh/AVeriTeC/tree/main/data). Test data will be released at a later date. Each claim follows the following structure:
12
+ ```json
13
+ {
14
+ "claim": "The claim text itself",
15
+ "required_reannotation": "True or False. Denotes that the claim received a second round of QG-QA and quality control annotation.",
16
+ "label": "The annotated verdict for the claim",
17
+ "justification": "A textual justification explaining how the verdict was reached from the question-answer pairs.",
18
+ "claim_date": "Our best estimate for the date the claim first appeared",
19
+ "speaker": "The person or organization that made the claim, e.g. Barrack Obama, The Onion.",
20
+ "original_claim_url": "If the claim first appeared on the internet, a url to the original location",
21
+ "cached_original_claim_url": "Where possible, an archive.org link to the original claim url",
22
+ "fact_checking_article": "The fact-checking article we extracted the claim from",
23
+ "reporting_source": "The website or organization that first published the claim, e.g. Facebook, CNN.",
24
+ "location_ISO_code": "The location most relevant for the claim. Highly useful for search.",
25
+ "claim_types": [
26
+ "The types of the claim",
27
+ ],
28
+ "fact_checking_strategies": [
29
+ "The strategies employed in the fact-checking article",
30
+ ],
31
+ "questions": [
32
+ {
33
+ "question": "A fact-checking question for the claim",
34
+ "answers": [
35
+ {
36
+ "answer": "The answer to the question",
37
+ "answer_type": "Whether the answer was abstractive, extractive, boolean, or unanswerable",
38
+ "source_url": "The source url for the answer",
39
+ "cached_source_url": "An archive.org link for the source url"
40
+ "source_medium": "The medium the answer appeared in, e.g. web text, a pdf, or an image.",
41
+ }
42
+ ]
43
+ },
44
+ }
45
+ ```
46
 
47
+ ## Reproduce the baseline
48
 
49
+ Below are the steps to reproduce the baseline results. The main difference from the reported results in the paper is that, instead of requiring direct access to the paid Google Search API, we provide such search results for up to 1000 URLs per claim using different queries, and the scraped text as a knowledge store for retrieval for each claim. This is aimed at reducing the overhead cost of participating in the Shared Task.
50
+
51
+
52
+ ### 0. Set up environment
53
+
54
+ ```bash
55
  conda create -n averitec python=3.11
56
  conda activate averitec
57
 
 
62
  conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
63
  ```
64
 
65
+ ### 1. Scrape text from the URLs obtained by searching queries with the Google API
66
 
67
+ The URLs of the search results and queries used for each claim can be found [here](https://huggingface.co/chenxwh/AVeriTeC/tree/main/data_store/urls).
68
 
69
+ Next, we scrape the text from the URLs and parse the text to sentences. The processed files are also provided and can be found [here](https://huggingface.co/chenxwh/AVeriTeC/tree/main/data_store/knowledge_store). You can use your own scraping tool to extract sentences from the URLs.
70
 
71
+ ```bash
72
  bash script/scraper.sh <split> <start_idx> <end_idx>
73
  # e.g., bash script/scraper.sh dev 0 500
74
  ```
75
 
76
+ ### 2. Rank the sentences in the knowledge store with BM25
77
+ Then, we rank the scraped sentences for each claim using BM25 (based on the similarity to the claim), keeping the top 100 sentences per claim.
78
+ See [bm25_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/bm25_sentences.py) for more argument options. We provide the output file for this step on the dev set [here]().
79
+ ```bash
80
  python -m src.reranking.bm25_sentences
81
  ```
82
 
83
+ ### 3. Generate questions-answer pair for the top sentences
84
+ We use [BLOOM](https://huggingface.co/bigscience/bloom-7b1) to generate QA paris for each of the top 100 sentence, providing 10 closest claim-QA-pairs from the training set as in-context examples. See [question_generation_top_sentences.py](https://huggingface.co/chenxwh/AVeriTeC/blob/main/src/reranking/question_generation_top_sentences.py) for more argument options. We provide the output file for this step on the dev set [here]().
85
+ ```bash
86
  python -m src.reranking.question_generation_top_sentences
87
+ ```
88
+
89
+ ### 4. Rerank the QA pairs
90
+ Using [a pre-trained BERT model](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_dual_encoder.ckpt) we rerank the QA paris and keep top 3 QA paris as evidence. We provide the output file for this step on the dev set [here]().
91
+ ```bash
92
+ ```
93
+
94
+
95
+ ### 5. Veracity prediction
96
+ Finally, given a claim and its 3 QA pairs as evidence, we use [another pre-trained BERT model](https://huggingface.co/chenxwh/AVeriTeC/blob/main/pretrained_models/bert_veracity.ckpt) to predict the veracity label. The pre-trained model is provided . We provide the prediction file for this step on the dev set [here]().
97
+ ```bash
98
+ ```
99
+ The results will be presented as follows:
100
+ ```bash
101
+ ```
102
+
103
+ We recommend using 0.25 as cut-off score for evaluating the relevance of the evidence. The result for dev and the test set below.
104
+
105
+
106
+
107
+ ## Citation
108
+ If you find AVeriTeC useful for your research and applications, please cite us using this BibTeX:
109
+ ```bibtex
110
+ @article{schlichtkrull2024averitec,
111
+ title={Averitec: A dataset for real-world claim verification with evidence from the web},
112
+ author={Schlichtkrull, Michael and Guo, Zhijiang and Vlachos, Andreas},
113
+ journal={Advances in Neural Information Processing Systems},
114
+ volume={36},
115
+ year={2024}
116
+ }
117
  ```
src/reranking/question_generation_top_sentences.py CHANGED
@@ -98,7 +98,7 @@ if __name__ == "__main__":
98
  offload_folder="./offload",
99
  )
100
 
101
- with open(args.output_questions, "a", encoding="utf-8") as output_file:
102
  with open(args.top_k_target_knowledge, "r", encoding="utf-8") as json_file:
103
  for i, line in enumerate(json_file):
104
  data = json.loads(line)
 
98
  offload_folder="./offload",
99
  )
100
 
101
+ with open(args.output_questions, "w", encoding="utf-8") as output_file:
102
  with open(args.top_k_target_knowledge, "r", encoding="utf-8") as json_file:
103
  for i, line in enumerate(json_file):
104
  data = json.loads(line)