Spaces:

PANH
/

alignscore-safetensor

Runtime error

App Files Files Community

PANH commited on Oct 9, 2024

Commit

ffca110

verified ·

1 Parent(s): 2d8296f

Upload 15 files

Browse files

Files changed (15) hide show

alignscore/LICENSE +21 -0
alignscore/README.md +216 -0
alignscore/alignscore_fig.png +0 -0
alignscore/baselines.py +704 -0
alignscore/benchmark.py +494 -0
alignscore/evaluate.py +1793 -0
alignscore/generate_training_data.py +1519 -0
alignscore/pyproject.toml +41 -0
alignscore/requirements.txt +9 -0
alignscore/src/alignscore/__init__.py +1 -0
alignscore/src/alignscore/alignscore.py +16 -0
alignscore/src/alignscore/dataloader.py +610 -0
alignscore/src/alignscore/inference.py +293 -0
alignscore/src/alignscore/model.py +308 -0
alignscore/train.py +144 -0

alignscore/LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 yuh-zha
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

alignscore/README.md ADDED Viewed

	@@ -0,0 +1,216 @@

+# AlignScore
+This is the repository for AlignScore, a metric for automatic factual consistency evaluation of text pairs introduced in \
+[AlignScore: Evaluating Factual Consistency with a Unified Alignment Function](https://arxiv.org/abs/2305.16739) \
+Yuheng Zha, Yichi Yang, Ruichen Li and Zhiting Hu \
+ACL 2023
+**Factual consistency evaluation** is to evaluate whether all the information in **b** is contained in **a** (**b** does not contradict **a**). For example, this is a factual inconsistent case:
+* **a**: Children smiling and waving at camera.
+* **b**: The kids are frowning.
+And this is a factual consistent case:
+* **a**: The NBA season of 1975 -- 76 was the 30th season of the National Basketball Association.
+* **b**: The 1975 -- 76 season of the National Basketball Association was the 30th season of the NBA.
+Factual consistency evaluation can be applied to many tasks like Summarization, Paraphrase and Dialog. For example, large language models often generate hallucinations when summarizing documents. We wonder if the generated text is factual consistent to its original context.
+# Leaderboards
+We introduce two leaderboards that compare AlignScore with similar-sized metrics and LLM-based metrics, respectively.
+## Leaderboard --- compare with similar-sized metrics
+We list the performance of AlignScore as well as other metrics on the SummaC (includes 6 datasets) and TRUE (includes 11 datasets) benchmarks, as well as other popular factual consistency datasets (include 6 datasets).
+| Rank | Metrics          | SummaC* | TRUE** | Other Datasets*** | Average**** | Paper | Code |
+| ---- | :--------------- | :-----: | :----: | :------------: | :-----: | :---: | :--: |
+| 1    | **AlignScore-large** |  88.6   |  83.8  |      49.3      |  73.9   |   [:page\_facing\_up:(Zha et al.  2023)](https://arxiv.org/pdf/2305.16739.pdf)   |  [:octocat:](https://github.com/yuh-zha/AlignScore)  |
+| 2    | **AlignScore-base**  |  87.4   |  82.5  |      44.9      |  71.6   |   [:page\_facing\_up:(Zha et al.  2023)](https://arxiv.org/pdf/2305.16739.pdf)   |  [:octocat:](https://github.com/yuh-zha/AlignScore)  |
+| 3    | QAFactEval       |  83.8   |  79.4  |      42.4      |  68.5   | [:page\_facing\_up:(Fabbri et al. 2022)](https://arxiv.org/abs/2112.08542) | [:octocat:](https://github.com/salesforce/QAFactEval) |
+| 4    | UniEval          |  84.6   |  78.0  |      41.5      |  68.0   | [:page\_facing\_up:(Zhong et al. 2022)](https://arxiv.org/abs/2210.07197) | [:octocat:](https://github.com/maszhongming/UniEval) |
+| 5    | SummaC-CONV      |  81.0   |  78.7  |      34.2      |  64.6   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
+| 6    | BARTScore        |  80.9   |  73.4  |      34.8      |  63.0   | [:page\_facing\_up:(Yuan et al. 2022)](https://arxiv.org/abs/2106.11520) | [:octocat:](https://github.com/neulab/BARTScore) |
+| 7    | CTC              |  81.2   |  72.4  |      35.3      |  63.0   | [:page\_facing\_up:(Deng et al. 2022)](https://arxiv.org/abs/2109.06379) | [:octocat:](https://github.com/tanyuqian/ctc-gen-eval) |
+| 8    | SummaC-ZS        |  79.0   |  78.2  |      30.4      |  62.5   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
+| 9    | ROUGE-2          |  78.1   |  72.4  |      27.9      |  59.5   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
+| 10   | ROUGE-1          |  77.4   |  72.0  |      28.6      |  59.3   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
+| 11   | ROUGE-L          |  77.3   |  71.8  |      28.3      |  59.1   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
+| 12   | QuestEval        |  72.5   |  71.4  |      25.0      |  56.3   | [:page\_facing\_up:(Scialom et al. 2021)](https://arxiv.org/abs/2103.12693) | [:octocat:](https://github.com/ThomasScialom/QuestEval) |
+| 13   | BLEU             |  76.3   |  67.3  |      24.6      |  56.1   | [:page\_facing\_up:(Papineni et al. 2002)](https://aclanthology.org/P02-1040/) | [:octocat:](https://www.nltk.org/_modules/nltk/translate/bleu_score.html) |
+| 14   | DAE              |  66.8   |  65.7  |      35.1      |  55.8   | [:page\_facing\_up:(Goyal and Durrett 2020)](https://aclanthology.org/2020.findings-emnlp.322/) | [:octocat:](https://github.com/tagoyal/dae-factuality) |
+| 15   | BLEURT           |  69.2   |  71.9  |      24.9      |  55.4   | [:page\_facing\_up:(Sellam et al. 2020)](https://arxiv.org/abs/2004.04696) | [:octocat:](https://github.com/google-research/bleurt) |
+| 16   | BERTScore        |  72.1   |  68.6  |      21.9      |  54.2   | [:page\_facing\_up:(Zhang et al. 2020)](https://arxiv.org/abs/1904.09675) | [:octocat:](https://github.com/Tiiiger/bert_score) |
+| 17   | SimCSE           |  67.4   |  70.3  |      23.8      |  53.8   | [:page\_facing\_up:(Gao et al. 2021)](https://arxiv.org/abs/2104.08821) | [:octocat:](https://github.com/princeton-nlp/SimCSE) |
+| 18   | FactCC           |  68.8   |  62.7  |      21.2      |  50.9   | [:page\_facing\_up:(Kryscinski et al. 2020)](https://arxiv.org/abs/1910.12840) | [:octocat:](https://github.com/salesforce/factCC) |
+| 19   | BLANC            |  65.1   |  64.0  |      14.4      |  47.8   | [:page\_facing\_up:(Vasilyev et al. 2020)](https://arxiv.org/abs/2002.09836) | [:octocat:](https://github.com/PrimerAI/blanc) |
+| 20   | NER-Overlap      |  60.4   |  59.3  |      18.9      |  46.2   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
+| 21   | MNLI             |  47.9   |  60.4  |      3.1       |  37.2   | [:page\_facing\_up:(Williams et al. 2018)](https://arxiv.org/abs/1704.05426) | [:octocat:](https://github.com/nyu-mll/multiNLI) |
+| 22   | FEQA             |  48.3   |  52.2  |      -1.9      |  32.9   | [:page\_facing\_up:(Durmus et al. 2020)](https://arxiv.org/abs/2005.03754) | [:octocat:](https://github.com/esdurmus/feqa) |
+\*  SummaC Benchmark: [\[Paper\]](https://arxiv.org/abs/2111.09525) \| [\[Github\]](https://github.com/tingofurro/summac). We report AUC ROC on the SummaC benchmark.
+** TRUE Benchmark: [\[Paper\]](https://arxiv.org/abs/2204.04991) \| [\[Github\]](https://github.com/google-research/true). We report AUC ROC on the TRUE benchmark.
+*** Besides the SummaC and TRUE benchmarks, we also include other popular factual consistency evaluation datasets: [XSumFaith](https://doi.org/10.18653/v1/2020.acl-main.173), [SummEval](https://doi.org/10.1162/tacl_a_00373), [QAGS-XSum](https://doi.org/10.18653/v1/2020.acl-main.450), [QAGS-CNNDM](https://doi.org/10.18653/v1/2020.acl-main.450), [FRANK-XSum](https://doi.org/10.18653/v1/2021.naacl-main.383), [FRANK-CNNDM](https://doi.org/10.18653/v1/2021.naacl-main.383) and [SamSum](https://doi.org/10.18653/v1/D19-5409). We compute the Spearman Correlation coefficients between the human annotated score and the metric predicted score, following common practice.
+**** To rank these metrics, we simply compute the average performance of SummaC, TRUE and Other Datasets.
+## Leaderboard --- compare with LLM-based metrics
+We also show the performance comparison with large-language-model based metrics below. The rank is based on the average Spearman Correlation coefficients on SummEval, QAGS-XSum and QAGS-CNNDM datasets.*
+| Rank | Metrics               | Base Model                                                   | SummEval | QAGS-XSUM | QAGS-CNNDM | Average |                            Paper                             |                             Code                             |
+| :--- | :-------------------- | :----------------------------------------------------------- | :------: | :-------: | :--------: | :--: | :----------------------------------------------------------: | :----------------------------------------------------------: |
+| 1    | **AlignScore-large**     | RoBERTa-l (355M)                                             |   46.6   |   57.2    |    73.9    | 59.3 | [:page\_facing\_up:(Zha et al.  2023)](https://arxiv.org/pdf/2305.16739.pdf) |      [:octocat:](https://github.com/yuh-zha/AlignScore)      |
+| 2    | G-EVAL-4              | GPT4                                                         |   50.7   |   53.7    |    68.5    | 57.6 | [:page\_facing\_up:(Liu et al.  2023)](https://arxiv.org/pdf/2303.16634.pdf) |        [:octocat:](https://github.com/nlpyang/geval)         |
+| 3    | **AlignScore-base**       | RoBERTa-b (125M)                                             |   43.4   |   51.9    |    69.0    | 54.8 | [:page\_facing\_up:(Zha et al.  2023)](https://arxiv.org/pdf/2305.16739.pdf) |      [:octocat:](https://github.com/yuh-zha/AlignScore)      |
+| 4    | FActScore (modified)** | GPT3.5-d03 + GPT3.5-turbo |   52.6   |   51.2    |    57.6    | 53.8 | [:page\_facing\_up:(Min et al.   2023)](https://arxiv.org/pdf/2305.14251.pdf) |      [:octocat:](https://github.com/shmsw25/FActScore)*      |
+| 5    | ChatGPT (Chen et al. 2023) | GPT3.5-turbo                                                 |   42.7   |   53.3    |    52.7    | 49.6 | [:page\_facing\_up:(Yi Chen et al.  2023)](https://arxiv.org/pdf/2305.14069.pdf) | [:octocat:](https://github.com/SJTU-LIT/llmeval_sum_factual) |
+| 6    | GPTScore              | GPT3.5-d03                                                   |   45.9   |   22.7    |    64.4    | 44.3 | [:page\_facing\_up:(Fu et al.  2023)](https://arxiv.org/pdf/2302.04166.pdf) |      [:octocat:](https://github.com/jinlanfu/GPTScore)       |
+| 7    | GPTScore              | GPT3-d01                                                     |   46.1   |   22.3    |    63.9    | 44.1 | [:page\_facing\_up:(Fu et al.  2023)](https://arxiv.org/pdf/2302.04166.pdf) |      [:octocat:](https://github.com/jinlanfu/GPTScore)       |
+| 8    | G-EVAL-3.5            | GPT3.5-d03                                                   |   38.6   |   40.6    |    51.6    | 43.6 | [:page\_facing\_up:(Liu et al.  2023)](https://arxiv.org/pdf/2303.16634.pdf) |        [:octocat:](https://github.com/nlpyang/geval)         |
+| 9    | ChatGPT (Gao et al. 2023) | GPT3.5-turbo                                                 |   41.6   |   30.4    |    48.9    | 40.3 | [:page\_facing\_up:(Gao et al.  2023)](https://arxiv.org/pdf/2304.02554.pdf) |                              -                               |
+\* We notice that evaluating factual consistency using GPT-based models is expensive and slow. And we need human labor to interpret the response (generally text) to numerical scores. Therefore, we only benchmark on 3 popular factual consistency evaluation datasets: SummEval, QAGS-XSum and QAGS-CNNDM.
+*\* We use a modified version of FActScore `retrieval+ChatGPT` where we skip the retrieval stage and use the context documents in SummEval, QAGS-XSUM, and QAGS-CNNDM directly. As samples in theses datasets do not have "topics", we make a small modification to the original FActScore prompt and do not mention `topic` when not available. See [our fork of FActScore](https://github.com/yichi-yang/FActScore) for more details.
+# Introduction
+The AlignScore metric is an automatic factual consistency evaluation metric built with the following parts:
+* Unified information alignment function between two arbitrary text pieces: It is trained on 4.7 million training examples from 7 well-established tasks (NLI, QA, paraphrasing, fact verification, information retrieval, semantic textual similarity and summarization)
+* The chunk-sentence splitting method: The input context is splitted into chunks (contains roughly 350 tokens each) and the input claim is splitted into sentences. With the help of the alignment function, it's possible to know the alignment score between chunks and sentences. We pick the maximum alignment score for each sentence and then average these scores to get the example-level factual consistency score (AlignScore).
+  <div align=center>
+  <img src="./alignscore_fig.png" alt="alignscore_fig" width="500px" />
+  </div>
+We assume there are two inputs to the metric, namely `context` and `claim`. And the metric evaluates whether the `claim` is factual consistent with the `context`. The output of AlignScore is a single numerical value, which shows the degree of the factual consistency.
+# Installation
+Our models are trained and evaluated using PyTorch 1.12.1. We recommend using this version to reproduce the results.
+1. Please first install the right version of PyTorch before installing `alignscore`.
+2. You can install `alignscore` by cloning this repository and `pip install .`.
+3. After installing `alignscore`, please use `python -m spacy download en_core_web_sm` to install the required spaCy model (we use `spaCy` for sentenization).
+# Evaluating Factual Consistency
+To evaluate the factual consistency of the `claim` w.r.t. the `context`, simply use the score method of `AlignScore`.
+```python
+from alignscore import AlignScore
+scorer = AlignScore(model='roberta-base', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
+score = scorer.score(contexts=['hello world.'], claims=['hello world.'])
+```
+`model`: the backbone model of the metric. Now, we only provide the metric trained on RoBERTa
+`batch_size`: the batch size of the inference
+`device`: which device to run the metric
+`ckpt_path`: the path to the checkpoint
+`evaluation_mode`: choose from `'nli_sp', 'nli', 'bin_sp', 'bin'`. `nli` and `bin` refer to the 3-way and binary classficiation head, respectively. `sp` indicates if the chunk-sentence splitting method is used. `nli_sp` is the default setting of AlignScore
+# Checkpoints
+We provide two versions of the AlignScore checkpoints: `AlignScore-base` and `AlignScore-large`. The `-base` model is based on RoBERTa-base and has 125M parameters. The `-large` model is based on RoBERTa-large and has 355M parameters.
+**AlignScore-base**:
+https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt
+**AlignScore-large**:
+https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt
+# Training
+You can use the above checkpoints directly for factual consistency evaluation. However, if you wish to train an alignment model from scratch / on your own data, use `train.py`.
+```python
+python train.py --seed 2022 --batch-size 32 \
+--num-epoch 3 --devices 0 1 2 3 \
+--model-name roberta-large -- ckpt-save-path ./ckpt/ \
+--data-path ./data/training_sets/ \
+--max-samples-per-dataset 500000
+```
+`--seed`: the random seed for initialization
+`--batch-size`: the batch size for training
+`--num-epoch`: training epochs
+`--devices`: which devices to train the metric, a list of GPU ids
+`--model-name`: the backbone model name of the metric, default RoBERTa-large
+`--ckpt-save-path`: the path to save the checkpoint
+`--training-datasets`: the names of the training datasets
+`--data-path`: the path to the training datasets
+`--max-samples-per-dataset`: the maximum number of samples from a dataset
+# Benchmarking
+Our benchmark includes the TRUE and SummaC benchmark as well as several popular factual consistency evaluation datasets.
+To run the benchmark, a few additional dependencies are required and can be installed with `pip install -r requirements.txt`.
+Additionally, some depedencies are not available as packages and need to be downloaded manually (please see `python benchmark.py --help` for instructions).
+Note installing `summac` may cause dependency conflicts with `alignscore`. Please reinstall `alignscore` to force the correct dependency versions.
+The relevant arguments for evaluating AlignScore are:
+`--alignscore`: evaluation the AlignScore metric
+`--alignscore-model`: the name of the backbone model (either 'roberta-base' or 'roberta-large')
+`--alignscore-ckpt`: the path to the saved checkpoint
+`--alignscore-eval-mode`: the evaluation mode, defaults to `nli_sp`
+`--device`: which device to run the metric, defaults to `cuda:0`
+`--tasks`: which tasks to benchmark, e.g., SummEval, QAGS-CNNDM, ...
+For the baselines, please see `python benchmark.py --help` for details.
+## Training datasets download
+Most datasets are downloadable from Huggingface (refer to [`generate_training_data.py`](https://github.com/yuh-zha/AlignScore/blob/main/generate_training_data.py)). Some datasets that needed to be imported manually are now also avaialable on Huggingface (See [Issue](https://github.com/yuh-zha/AlignScore/issues/6#issuecomment-1695448614)).
+## Evaluation datasets download
+The following table shows the links to the evaluation datasets mentioned in the paper
+| Benchmark/Dataset | Link                                                         |
+| ----------------- | ------------------------------------------------------------ |
+| SummaC            | https://github.com/tingofurro/summac                         |
+| TRUE              | https://github.com/google-research/true                      |
+| XSumFaith         | https://github.com/google-research-datasets/xsum_hallucination_annotations |
+| SummEval          | https://github.com/tanyuqian/ctc-gen-eval/blob/master/train/data/summeval.json |
+| QAGS-Xsum         | https://github.com/tanyuqian/ctc-gen-eval/blob/master/train/data/qags_xsum.json |
+| QAGS-CNNDM        | https://github.com/tanyuqian/ctc-gen-eval/blob/master/train/data/qags_cnndm.json |
+| FRANK-XSum        | https://github.com/artidoro/frank                            |
+| FRANK-CNNDM       | https://github.com/artidoro/frank                            |
+| SamSum            | https://github.com/skgabriel/GoFigure/blob/main/human_eval/samsum.jsonl |
+# Citation
+If you find the metric and this repo helpful, please consider cite:
+```
+@inproceedings{zha-etal-2023-alignscore,
+    title = "{A}lign{S}core: Evaluating Factual Consistency with A Unified Alignment Function",
+    author = "Zha, Yuheng  and
+      Yang, Yichi  and
+      Li, Ruichen  and
+      Hu, Zhiting",
+    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = jul,
+    year = "2023",
+    address = "Toronto, Canada",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.acl-long.634",
+    pages = "11328--11348",
+    abstract = "Many text generation applications require the generated text to be factually consistent with input information. Automatic evaluation of factual consistency is challenging. Previous work has developed various metrics that often depend on specific functions, such as natural language inference (NLI) or question answering (QA), trained on limited data. Those metrics thus can hardly assess diverse factual inconsistencies (e.g., contradictions, hallucinations) that occur in varying inputs/outputs (e.g., sentences, documents) from different tasks. In this paper, we propose AlignScore, a new holistic metric that applies to a variety of factual inconsistency scenarios as above. AlignScore is based on a general function of information alignment between two arbitrary text pieces. Crucially, we develop a unified training framework of the alignment function by integrating a large diversity of data sources, resulting in 4.7M training examples from 7 well-established tasks (NLI, QA, paraphrasing, fact verification, information retrieval, semantic similarity, and summarization). We conduct extensive experiments on large-scale benchmarks including 22 evaluation datasets, where 19 of the datasets were never seen in the alignment training. AlignScore achieves substantial improvement over a wide range of previous metrics. Moreover, AlignScore (355M parameters) matches or even outperforms metrics based on ChatGPT and GPT-4 that are orders of magnitude larger.",
+}
+```

alignscore/alignscore_fig.png ADDED Viewed

alignscore/baselines.py ADDED Viewed

	@@ -0,0 +1,704 @@

+from logging import warning
+import torch
+import torch.nn as nn
+import numpy as np
+from tqdm import tqdm
+import spacy
+from sklearn.metrics.pairwise import cosine_similarity
+from nltk.tokenize import sent_tokenize
+import json
+class CTCScorer():
+    def __init__(self, model_type) -> None:
+        self.model_type = model_type
+        import nltk
+        nltk.download('stopwords')
+        from ctc_score import StyleTransferScorer, SummarizationScorer, DialogScorer
+        if model_type == 'D-cnndm':
+            self.scorer = SummarizationScorer(align='D-cnndm')
+        elif model_type =='E-roberta':
+            self.scorer = SummarizationScorer(align='E-roberta')
+        elif model_type == 'R-cnndm':
+            self.scorer = SummarizationScorer(align='R-cnndm')
+    def score(self, premise: list, hypo: list):
+        assert len(premise) == len(hypo), "Premise and hypothesis should have the same length"
+        output_scores = []
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), total=len(premise), desc="Evaluating by ctc"):
+            score_for_this_example = self.scorer.score(doc=one_pre, refs=[], hypo=one_hypo, aspect='consistency')
+            if score_for_this_example is not None:
+                output_scores.append(score_for_this_example)
+            else:
+                output_scores.append(1e-8)
+        output = None, torch.tensor(output_scores), None
+        return output
+class SimCSEScorer():
+    def __init__(self, model_type, device) -> None:
+        self.model_type = model_type
+        self.device = device
+        from transformers import AutoModel, AutoTokenizer
+        # refer to the model list on https://github.com/princeton-nlp/SimCSE for the list of models
+        self.tokenizer = AutoTokenizer.from_pretrained(model_type)
+        self.model = AutoModel.from_pretrained(model_type).to(self.device)
+        self.spacy = spacy.load('en_core_web_sm')
+        self.batch_size = 64
+    def score(self, premise: list, hypo: list):
+        assert len(premise) == len(hypo)
+        output_scores = []
+        premise_sents = []
+        premise_index = [0]
+        hypo_sents = []
+        hypo_index = [0]
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), desc="Sentenizing", total=len(premise)):
+            premise_sent = sent_tokenize(one_pre) #[each.text for each in self.spacy(one_pre).sents]
+            hypo_sent = sent_tokenize(one_hypo) #[each.text for each in self.spacy(one_hypo).sents]
+            premise_sents.extend(premise_sent)
+            premise_index.append(len(premise_sents))
+            hypo_sents.extend(hypo_sent)
+            hypo_index.append(len(hypo_sents))
+        all_sents = premise_sents + hypo_sents
+        embeddings = []
+        with torch.no_grad():
+            for batch in tqdm(self.chunks(all_sents, self.batch_size), total=int(len(all_sents)/self.batch_size), desc="Evaluating by SimCSE"):
+                inputs = self.tokenizer(batch, padding=True, truncation=True, return_tensors="pt").to(self.device)
+                embeddings.append(self.model(**inputs, output_hidden_states=True, return_dict=True).pooler_output)
+            embeddings = torch.cat(embeddings)
+            assert len(premise_index) == len(hypo_index)
+            for i in range(len(premise_index)-1):
+                premise_embeddings = embeddings[premise_index[i]: premise_index[i+1]]
+                hypo_embeddings = embeddings[len(premise_sents)+hypo_index[i]:len(premise_sents)+hypo_index[i+1]]
+                cos_sim = cosine_similarity(premise_embeddings.cpu(), hypo_embeddings.cpu())
+                score_p = cos_sim.max(axis=0).mean()
+                score_r = cos_sim.max(axis=1).mean()
+                score_f = 2 * score_p * score_r / (score_p + score_r)
+                output_scores.append(score_f)
+        return torch.Tensor(output_scores), torch.Tensor(output_scores), None
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+class BleurtScorer():
+    def __init__(self, checkpoint) -> None:
+        self.checkpoint = checkpoint
+        from bleurt import score
+        # BLEURT-20 can also be switched to other checkpoints to improve time
+        # No avaliable api to specify cuda number
+        self.model = score.BleurtScorer(self.checkpoint)
+    def scorer(self, premise:list, hypo: list):
+        assert len(premise) == len(hypo)
+        output_scores = self.model.score(references=premise, candidates=hypo, batch_size=8)
+        output_scores = [s for s in output_scores]
+        return torch.Tensor(output_scores), torch.Tensor(output_scores), torch.Tensor(output_scores)
+class BertScoreScorer():
+    def __init__(self, model_type, metric, device, batch_size) -> None:
+        self.model_type = model_type
+        self.device = device
+        self.metric = metric
+        self.batch_size = batch_size
+        from bert_score import score
+        self.model = score
+    def scorer(self, premise: list, hypo: list):
+        assert len(premise) == len(hypo)
+        precision, recall, f1 = self.model(premise, hypo, model_type=self.model_type, lang='en', rescale_with_baseline=True, verbose=True, device=self.device, batch_size=self.batch_size)
+        f1 = [f for f in f1]
+        precision = [p for p in precision]
+        recall = [r for r in recall]
+        if self.metric == 'f1':
+            return torch.Tensor(f1), torch.Tensor(f1), None
+        elif self.metric == 'precision':
+            return torch.Tensor(precision), torch.Tensor(precision), None
+        elif self.metric == 'recall':
+            return torch.Tensor(recall), torch.Tensor(recall), None
+        else:
+            ValueError("metric type not in f1, precision or recall.")
+class BartScoreScorer():
+    def __init__(self, checkpoint, device) -> None:
+        self.checkpoint = checkpoint
+        self.device = device
+        import os, sys
+        sys.path.append('baselines/BARTScore')
+        from bart_score import BARTScorer
+        self.model = BARTScorer(device=self.device, checkpoint=self.checkpoint)
+    def scorer(self, premise: list, hypo: list):
+        assert len(premise) == len(hypo)
+        output_scores = self.model.score(premise, hypo, batch_size=4)
+        normed_score = torch.exp(torch.Tensor(output_scores))
+        return normed_score, normed_score, normed_score
+### Below are baselines in SummaC
+### MNLI, NER, FactCC, DAE, FEQA, QuestEval, SummaC-ZS, SummaC-Conv
+class MNLIScorer():
+    def __init__(self, model="roberta-large-mnli", device='cuda:0', batch_size=32) -> None:
+        from transformers import AutoTokenizer, AutoModelForSequenceClassification
+        self.tokenizer = AutoTokenizer.from_pretrained(model)
+        self.model = AutoModelForSequenceClassification.from_pretrained(model).to(device)
+        self.device = device
+        self.softmax = nn.Softmax(dim=-1)
+        self.batch_size = batch_size
+    def scorer(self, premise: list, hypo: list):
+        if isinstance(premise, str) and isinstance(hypo, str):
+            premise = [premise]
+            hypo = [hypo]
+        batch = self.batch_tokenize(premise, hypo)
+        output_score_tri = []
+        for mini_batch in tqdm(batch, desc="Evaluating MNLI"):
+        # for mini_batch in batch:
+            mini_batch = mini_batch.to(self.device)
+            with torch.no_grad():
+                model_output = self.model(**mini_batch)
+                model_output_tri = model_output.logits
+                model_output_tri = self.softmax(model_output_tri).cpu()
+            output_score_tri.append(model_output_tri[:,2])
+        output_score_tri = torch.cat(output_score_tri)
+        return output_score_tri, output_score_tri, output_score_tri
+    def batch_tokenize(self, premise, hypo):
+        """
+        input premise and hypos are lists
+        """
+        assert isinstance(premise, list) and isinstance(hypo, list)
+        assert len(premise) == len(hypo), "premise and hypo should be in the same length."
+        batch = []
+        for mini_batch_pre, mini_batch_hypo in zip(self.chunks(premise, self.batch_size), self.chunks(hypo, self.batch_size)):
+            try:
+                mini_batch = self.tokenizer(mini_batch_pre, mini_batch_hypo, truncation='only_first', padding='max_length', max_length=self.tokenizer.model_max_length, return_tensors='pt')
+            except:
+                warning('text_b too long...')
+                mini_batch = self.tokenizer(mini_batch_pre, mini_batch_hypo, truncation=True, padding='max_length', max_length=self.tokenizer.model_max_length, return_tensors='pt')
+            batch.append(mini_batch)
+        return batch
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+class NERScorer():
+    def __init__(self) -> None:
+        import os, sys
+        sys.path.append('baselines/summac/summac')
+        from model_guardrails import NERInaccuracyPenalty
+        self.ner = NERInaccuracyPenalty()
+    def scorer(self, premise, hypo):
+        score_return = self.ner.score(premise, hypo)['scores']
+        oppo_score = [float(not each) for each in score_return]
+        tensor_score = torch.tensor(oppo_score)
+        return tensor_score, tensor_score, tensor_score
+class UniEvalScorer():
+    def __init__(self, task='fact', device='cuda:0') -> None:
+        import os, sys
+        sys.path.append('baselines/UniEval')
+        from metric.evaluator import get_evaluator
+        self.evaluator = get_evaluator(task, device=device)
+    def scorer(self, premise, hypo):
+        from utils import convert_to_json
+        # Prepare data for pre-trained evaluators
+        data = convert_to_json(output_list=hypo, src_list=premise)
+        # Initialize evaluator for a specific task
+        # Get factual consistency scores
+        eval_scores = self.evaluator.evaluate(data, print_result=True)
+        score_list = [each['consistency'] for each in eval_scores]
+        return torch.tensor(score_list), torch.tensor(score_list), torch.tensor(score_list)
+class FEQAScorer():
+    def __init__(self) -> None:
+        import os, sys
+        sys.path.append('baselines/feqa')
+        import benepar
+        import nltk
+        benepar.download('benepar_en3')
+        nltk.download('stopwords')
+        from feqa import FEQA
+        self.feqa_model = FEQA(squad_dir=os.path.abspath('baselines/feqa/qa_models/squad1.0'), bart_qa_dir=os.path.abspath('baselines/feqa/bart_qg/checkpoints/'), use_gpu=True)
+    def scorer(self, premise, hypo):
+        eval_score = self.feqa_model.compute_score(premise, hypo, aggregate=False)
+        return torch.tensor(eval_score), torch.tensor(eval_score), torch.tensor(eval_score)
+class QuestEvalScorer():
+    def __init__(self) -> None:
+        import os, sys
+        sys.path.append('baselines/QuestEval')
+        from questeval.questeval_metric import QuestEval
+        self.questeval = QuestEval(no_cuda=False)
+    def scorer(self, premise, hypo):
+        score = self.questeval.corpus_questeval(
+                hypothesis=hypo,
+                sources=premise
+            )
+        final_score = score['ex_level_scores']
+        return torch.tensor(final_score), torch.tensor(final_score), torch.tensor(final_score)
+class QAFactEvalScorer():
+    def __init__(self, model_folder, device='cuda:0') -> None:
+        import os, sys
+        sys.path.append('baselines/QAFactEval')
+        sys.path.append(os.path.abspath('baselines/qaeval/'))
+        from qafacteval import QAFactEval
+        kwargs = {"cuda_device": int(device.split(':')[-1]), "use_lerc_quip": True, \
+                "verbose": True, "generation_batch_size": 32, \
+                "answering_batch_size": 32, "lerc_batch_size": 8}
+        self.metric = QAFactEval(
+            lerc_quip_path=f"{model_folder}/quip-512-mocha",
+            generation_model_path=f"{model_folder}/generation/model.tar.gz",
+            answering_model_dir=f"{model_folder}/answering",
+            lerc_model_path=f"{model_folder}/lerc/model.tar.gz",
+            lerc_pretrained_model_path=f"{model_folder}/lerc/pretraining.tar.gz",
+            **kwargs
+                            )
+    def scorer(self, premise, hypo):
+        results = self.metric.score_batch_qafacteval(premise, [[each] for each in hypo], return_qa_pairs=True)
+        score = [result[0]['qa-eval']['lerc_quip'] for result in results]
+        return torch.tensor(score), torch.tensor(score), torch.tensor(score)
+class MoverScorer():
+    def __init__(self) -> None:
+        pass
+class BERTScoreFFCIScorer():
+    def __init__(self) -> None:
+        pass
+class DAEScorer():
+    def __init__(self, model_dir, device=0) -> None:
+        import os, sys
+        sys.path.insert(0, "baselines/factuality-datasets/")
+        from evaluate_generated_outputs import daefact
+        self.dae = daefact(model_dir, model_type='electra_dae', gpu_device=device)
+    def scorer(self, premise, hypo):
+        return_score = torch.tensor(self.dae.score_multi_doc(premise, hypo))
+        return return_score, return_score, return_score
+class SummaCScorer():
+    def __init__(self, summac_type='conv', device='cuda:0') -> None:
+        self.summac_type = summac_type
+        import os, sys
+        sys.path.append("baselines/summac")
+        from summac.model_summac import SummaCZS, SummaCConv
+        if summac_type == 'conv':
+            self.model = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device=device, start_file="default", agg="mean")
+        elif summac_type == 'zs':
+            self.model = SummaCZS(granularity="sentence", model_name="vitc", device=device) # If you have a GPU: switch to: device="cuda"
+    def scorer(self, premise, hypo):
+        assert len(premise) == len(hypo)
+        scores = self.model.score(premise, hypo)['scores']
+        return_score = torch.tensor(scores)
+        return return_score, return_score, return_score
+class FactCCScorer():
+    def __init__(self, script_path, test_data_path,result_path) -> None:
+        self.script_path = script_path
+        self.result_path = result_path
+        self.test_data_path = test_data_path
+    def scorer(self, premise, hypo):
+        import subprocess
+        import pickle
+        self.generate_json_file(premise, hypo)
+        subprocess.call(f"sh {self.script_path}", shell=True)
+        print("Finishing FactCC")
+        results = pickle.load(open(self.result_path, 'rb'))
+        results = [-each+1 for each in results]
+        return torch.tensor(results), torch.tensor(results), torch.tensor(results)
+    def generate_json_file(self, premise, hypo):
+        output = []
+        assert len(premise) == len(hypo)
+        i = 0
+        for one_premise, one_hypo in zip(premise, hypo):
+            example = dict()
+            example['id'] = i
+            example['text'] = one_premise
+            example['claim'] = one_hypo
+            example['label'] = 'CORRECT'
+            i += 1
+            output.append(example)
+        with open(self.test_data_path, 'w', encoding='utf8') as f:
+            for each in output:
+                json.dump(each, f, ensure_ascii=False)
+                f.write('\n')
+class BLANCScorer():
+    def __init__(self, device='cuda', batch_size=64) -> None:
+        from blanc import BlancHelp, BlancTune
+        self.blanc_help = BlancHelp(device=device, inference_batch_size=batch_size)
+    def scorer(self, premise, hypo):
+        score = self.blanc_help.eval_pairs(premise, hypo)
+        return_score = torch.tensor(score)
+        return return_score, return_score, return_score
+class BLEUScorer():
+    def __init__(self, n_grams=1) -> None:
+        self.n_grams = n_grams
+        self.n_gram_map = {
+            1: (1,0,0,0),
+            2: (0.5,0.5,0,0),
+            3: (1./3,1./3,1./3,0),
+            4: (0.25,0.25,0.25,0.25)
+        }
+    def scorer(self, premise, hypo):
+        from nltk.translate.bleu_score import sentence_bleu
+        assert len(premise) == len(hypo), "premise and hypothesis should be the same length!"
+        output_score = []
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), desc=f"Evaluating BLEU-{self.n_grams}", total=len(premise)):
+            scores = []
+            pre_sents = sent_tokenize(one_pre)
+            references = [[each for each in sent.split()] for sent in pre_sents]
+            for hypo_sent in sent_tokenize(one_hypo):
+                hypothesis = [each for each in hypo_sent.split()]
+                scores.append(sentence_bleu(references=references, hypothesis=hypothesis, weights=self.n_gram_map[self.n_grams]))
+            output_score.append(sum(scores)/len(scores) if len(scores)>0 else 0.)
+        return torch.tensor(output_score), torch.tensor(output_score), torch.tensor(output_score)
+class ROUGEScorer():
+    def __init__(self, rouge_type='1') -> None:
+        from rouge import Rouge
+        self.rouge = Rouge()
+        self.rouge_type = rouge_type
+    def scorer(self, premise, hypo):
+        assert len(premise) == len(hypo), "premise and hypothesis should be the same length!"
+        output_score = []
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), desc=f"Evaluating ROUGE-{self.rouge_type}", total=len(premise)):
+            scores = []
+            for pre_sent in sent_tokenize(one_pre):
+                for hypo_sent in sent_tokenize(one_hypo):
+                    try:
+                        scores.append(self.rouge.get_scores(pre_sent, hypo_sent)[0][f"rouge-{self.rouge_type}"]['f'])
+                    except:
+                        if len(pre_sent.strip()) == 0:
+                            print('premise sent is empty')
+                        elif len(hypo_sent.strip()) == 0:
+                            print('hypo sent is empty')
+                        scores.append(0.0)
+            scores = np.array(scores)
+            scores = scores.reshape((len(sent_tokenize(one_pre)), len(sent_tokenize(one_hypo))))
+            scores = scores.max(axis=0).mean()
+            output_score.append(scores.item())
+        return torch.tensor(output_score), torch.tensor(output_score), torch.tensor(output_score)
+class GPTScoreScorer():
+    def __init__(self, api_key, gpt_model='davinci003') -> None:
+        import os, sys
+        sys.path.append('../BaselineForNLGEval/GPTScore')
+        from gpt3_score import gpt3score
+        self.gpt3score = gpt3score
+        self.api_key = api_key
+        self.gpt_model = gpt_model
+        self.consistency_prefix = "Generate factually consistent summary for the following text: "
+        self.consistency_suffix = " \n\nTl;dr "
+    def scorer(self, premise: list, hypothesis: list):
+        assert len(premise) == len(hypothesis)
+        output_score = []
+        for p, h in tqdm(zip(premise, hypothesis), total=len(premise), desc="Evaluating GPTScore"):
+            score = self.gpt3score(input=self.consistency_prefix + p + self.consistency_suffix, output=h, gpt3model=self.gpt_model, api_key=self.api_key)
+            output_score.append(score)
+        output_score = torch.tensor(output_score)
+        return None, output_score, None
+class ChatGPTLuo2023Scorer():
+    def __init__(self, task, api_key, chat_model='gpt-3.5-turbo') -> None:
+        openai.api_key = api_key
+        assert isinstance(task, list) and len(task) == 1
+        self.task = task[0]
+        self.chat_model = chat_model
+        self.instruct = """Score the following summary given the corresponding article with respect to consistency from 1 to 10. Note that consistency measures how much information included in the summary is present in the source article. 10 points indicate the summary contains only statements that are entailed by the source document."""
+    def scorer(self, premise: list, hypothesis: list):
+        import time
+        assert len(premise) == len(hypothesis)
+        output_score = []
+        i = -1
+        for p, h in tqdm(zip(premise, hypothesis), total=len(premise), desc="Evaluating ChatGPTLuo2023"):
+            i += 1
+            if i <= -1: continue
+            attempt = 0
+            max_attempt = 5
+            while attempt < max_attempt:
+                try:
+                    response = openai.ChatCompletion.create(
+                                model=self.chat_model,
+                                messages=[
+                            #         {"role": "system", "content": "You are a helpful assistant."},
+                                    {"role": "user", "content": f"""Score the following summary given the corresponding article with respect to consistency from 1 to 10. Note that consistency measures how much information included in the summary is present in the source article. 10 points indicate the summary contains only statements that are entailed by the source document.
+                                        Summary: {h}
+                                        Article: {p} """},
+                                ],
+                                temperature=0,
+                                max_tokens=10
+                            )
+                    res_content = response['choices'][0]['message']['content']
+                    break
+                except:
+                    attempt += 1
+                    print("openai api failed")
+                    if max_attempt == attempt:
+                        print("maximum failed attempts reached. exiting...")
+                        exit()
+            json.dump({i: res_content}, open(f'exp_results/nlg_eval_fact/baselines/ChatGPTLuo2023-output/{self.task}.json', 'a'))
+            with open(f'exp_results/nlg_eval_fact/baselines/ChatGPTLuo2023-output/{self.task}.json', 'a') as f:
+                f.write('\n')
+            try:
+                score = int(res_content)
+            except:
+                print("unknown score")
+                score = 0.0
+            output_score.append(score)
+            # time.sleep(1)
+        output_score = torch.tensor(output_score)
+        return None, output_score, None
+class ChatGPTGao2023Scorer():
+    def __init__(self, task, api_key, chat_model='gpt-3.5-turbo') -> None:
+        openai.api_key = api_key
+        assert isinstance(task, list) and len(task) == 1
+        self.task = task[0]
+        self.chat_model = chat_model
+    def scorer(self, premise: list, hypothesis: list):
+        import time
+        assert len(premise) == len(hypothesis)
+        output_score = []
+        i = -1
+        for p, h in tqdm(zip(premise, hypothesis), total=len(premise), desc="Evaluating ChatGPTGao2023"):
+            i += 1
+            if i <= -1: continue
+            attempt = 0
+            max_attempt = 5
+            while attempt < max_attempt:
+                try:
+                    response = openai.ChatCompletion.create(
+                                model=self.chat_model,
+                                messages=[
+                                    # {"role": "system", "content": "You are a human annotator that rates the quality of summaries"},
+                                    # {"role": "user", "content": f"""Imagine you are a human annotator now. You will evaluate the quality of summaries written for a news article. Please follow these steps:\n\n 1. Carefully read the news article, and be aware of the information it contains.\n 2. Read the proposed summary.\n 3. Rate the summary on four dimensions: relevance, consistency, fluency, and coherence. You should rate on a scale from 1 (worst) to 5 (best).\n\n  Definitions are as follows:\n Relevance: The rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.\n Consistency: The rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.\n Fluency: This rating measures the quality of individual sentences, whether they are well-written and grammatically correct. Consider the quality of individual sentences.\n Coherence: The rating measures the quality of all sentences collectively, to fit together and sound natural. Consider the quality of the summary as a whole.\n\n The article and the summary are given below:\n Article: {p}\n Summary: {h}"""},
+                                    {"role": "user", "content": f"""Evaluate the quality of summaries written for a news article. Rate each summary on four dimensions: relevance, faithfulness, fluency, and coherence. You should rate on a scale from 1 (worst) to 5 (best).\n\n Article: {p}\n Summary: {h}"""},
+                                ],
+                                temperature=0,
+                                # max_tokens=10
+                            )
+                    res_content = response['choices'][0]['message']['content']
+                    break
+                except:
+                    attempt += 1
+                    print("openai api failed")
+                    if max_attempt == attempt:
+                        print("maximum failed attempts reached. exiting...")
+                        exit()
+            json.dump({i: res_content}, open(f'exp_results/nlg_eval_fact/baselines/ChatGPTGao2023-output/{self.task}.json', 'a'))
+            with open(f'exp_results/nlg_eval_fact/baselines/ChatGPTGao2023-output/{self.task}.json', 'a') as f:
+                f.write('\n')
+            try:
+                score = int(res_content)
+            except:
+                print("unknown score")
+                score = 0.0
+            output_score.append(score)
+            # time.sleep(1)
+        output_score = torch.tensor(output_score)
+        return None, output_score, None
+class ChatGPTYiChen2023Scorer():
+    def __init__(self, task, api_key, chat_model='gpt-3.5-turbo') -> None:
+        ### Explicit score by ChatGPT
+        openai.api_key = api_key
+        assert isinstance(task, list) and len(task) == 1
+        self.task = task[0]
+        self.chat_model = chat_model
+    def scorer(self, premise: list, hypothesis: list):
+        import time
+        assert len(premise) == len(hypothesis)
+        output_score = []
+        i = -1
+        for p, h in tqdm(zip(premise, hypothesis), total=len(premise), desc="Evaluating ChatGPTYiChen2023"):
+            i += 1
+            if i <= -1: continue
+            attempt = 0
+            max_attempt = 5
+            while attempt < max_attempt:
+                try:
+                    response = openai.ChatCompletion.create(
+                                model=self.chat_model,
+                                messages=[
+                                    # {"role": "system", "content": "You are a human annotator that rates the quality of summaries"},
+                                    # {"role": "user", "content": f"""Imagine you are a human annotator now. You will evaluate the quality of summaries written for a news article. Please follow these steps:\n\n 1. Carefully read the news article, and be aware of the information it contains.\n 2. Read the proposed summary.\n 3. Rate the summary on four dimensions: relevance, consistency, fluency, and coherence. You should rate on a scale from 1 (worst) to 5 (best).\n\n  Definitions are as follows:\n Relevance: The rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.\n Consistency: The rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.\n Fluency: This rating measures the quality of individual sentences, whether they are well-written and grammatically correct. Consider the quality of individual sentences.\n Coherence: The rating measures the quality of all sentences collectively, to fit together and sound natural. Consider the quality of the summary as a whole.\n\n The article and the summary are given below:\n Article: {p}\n Summary: {h}"""},
+                                    {"role": "user", "content": f"""Score the following storyline given the beginning of the story on a continual scale from 0 (worst) to 100 (best), where score of 0 means "The storyline makes no sense and is totally not understandable" and score of 100 means "The storyline is perfect-written and highly consistent with the given beginning of the story". \n\n The beginning of the story: {p} \n\n Storyline: {h} \n\n Score: """},
+                                ],
+                                temperature=0,
+                                # max_tokens=10
+                            )
+                    res_content = response['choices'][0]['message']['content']
+                    break
+                except:
+                    attempt += 1
+                    print("openai api failed")
+                    if max_attempt == attempt:
+                        print("maximum failed attempts reached. exiting...")
+                        exit()
+            json.dump({i: res_content}, open(f'exp_results/nlg_eval_fact/baselines/ChatGPTYiChen2023-output/{self.task}.json', 'a'))
+            with open(f'exp_results/nlg_eval_fact/baselines/ChatGPTYiChen2023-output/{self.task}.json', 'a') as f:
+                f.write('\n')
+            try:
+                score = int(res_content)
+            except:
+                print("unknown score")
+                score = 0.0
+            output_score.append(score)
+            # time.sleep(1)
+        output_score = torch.tensor(output_score)
+        return None, output_score, None
+class ChatGPTShiqiChen2023Scorer():
+    def __init__(self, task, api_key, chat_model='gpt-3.5-turbo') -> None:
+        ### Explicit score by ChatGPT
+        openai.api_key = api_key
+        assert isinstance(task, list) and len(task) == 1
+        self.task = task[0]
+        self.chat_model = chat_model
+    def scorer(self, premise: list, hypothesis: list):
+        import time
+        assert len(premise) == len(hypothesis)
+        output_score = []
+        i = -1
+        for p, h in tqdm(zip(premise, hypothesis), total=len(premise), desc="Evaluating ChatGPTShiqiChen2023"):
+            i += 1
+            if i <= -1: continue
+            hypo_sents = sent_tokenize(h)
+            hypo_sents = ' \n '.join([f"{i+1}. "+each for i, each in enumerate(hypo_sents)])
+            attempt = 0
+            max_attempt = 5
+            while attempt < max_attempt:
+                try:
+                    response = openai.ChatCompletion.create(
+                                model=self.chat_model,
+                                messages=[
+                                    # {"role": "system", "content": "You are a human annotator that rates the quality of summaries"},
+                                    # {"role": "user", "content": f"""Imagine you are a human annotator now. You will evaluate the quality of summaries written for a news article. Please follow these steps:\n\n 1. Carefully read the news article, and be aware of the information it contains.\n 2. Read the proposed summary.\n 3. Rate the summary on four dimensions: relevance, consistency, fluency, and coherence. You should rate on a scale from 1 (worst) to 5 (best).\n\n  Definitions are as follows:\n Relevance: The rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.\n Consistency: The rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.\n Fluency: This rating measures the quality of individual sentences, whether they are well-written and grammatically correct. Consider the quality of individual sentences.\n Coherence: The rating measures the quality of all sentences collectively, to fit together and sound natural. Consider the quality of the summary as a whole.\n\n The article and the summary are given below:\n Article: {p}\n Summary: {h}"""},
+                                    {"role": "user", "content": f"""Source Document: \n {p} \n\n Q: Can the following statement be inferred from the above document? Yes or No?\n {hypo_sents} \n A: 1. """},
+                                ],
+                                temperature=0,
+                                # max_tokens=10
+                            )
+                    res_content = response['choices'][0]['message']['content']
+                    break
+                except:
+                    attempt += 1
+                    print("openai api failed")
+                    if max_attempt == attempt:
+                        print("maximum failed attempts reached. exiting...")
+                        exit()
+            json.dump({i: res_content}, open(f'exp_results/nlg_eval_fact/baselines/ChatGPTShiqiChen2023-output/{self.task}.json', 'a'))
+            with open(f'exp_results/nlg_eval_fact/baselines/ChatGPTShiqiChen2023-output/{self.task}.json', 'a') as f:
+                f.write('\n')
+            try:
+                score = int(res_content)
+            except:
+                print("unknown score")
+                score = 0.0
+            output_score.append(score)
+            # time.sleep(1)
+        output_score = torch.tensor(output_score)
+        return None, output_score, None

alignscore/benchmark.py ADDED Viewed

	@@ -0,0 +1,494 @@

+from evaluate import Evaluator, ALL_TASKS
+from baselines import *
+from alignscore.inference import Inferencer
+import time
+import json
+import os
+from argparse import ArgumentParser
+SAVE_ALL_TABLES = True
+SAVE_AND_PRINT_TIMER = False
+class Timer():
+    def __init__(self) -> None:
+        self.t0 = time.time()
+        self.save_path = 'exp_results/time.json'
+    def finish(self, display_name):
+        t1 = time.time()
+        time_pass = t1 - self.t0
+        if SAVE_AND_PRINT_TIMER:
+            print(f"Evalautor {display_name} finished in {time_pass} secs.")
+            with open(self.save_path, 'a', encoding='utf8') as f:
+                json.dump({display_name: time_pass}, f)
+                f.write('\n')
+def eval_ctc(model_type, tasks=ALL_TASKS):
+    ctc_scorer = CTCScorer(model_type)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=ctc_scorer.score, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/CTC-{model_type}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"CTC-{model_type}")
+def eval_simcse(model_type, device, tasks=ALL_TASKS):
+    simcse_scorer = SimCSEScorer(model_type, device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=simcse_scorer.score, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/{model_type.split('/')[-1]}_f"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"{model_type.split('/')[-1]}_f")
+def eval_bleurt(checkpoint, tasks=ALL_TASKS):
+    bleurt_scorer = BleurtScorer(checkpoint)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=bleurt_scorer.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/BLEURT"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"BLEURT")
+def eval_bertscore(model_type, device, batch_size, tasks=ALL_TASKS):
+    bertscore_scorer = BertScoreScorer(model_type=model_type, metric='f1', device=device, batch_size=batch_size)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=bertscore_scorer.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/bertscore_{model_type.replace('/', '-')}_f"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"bertscore_{model_type.replace('/', '-')}_f")
+def eval_bartscore(checkpoint, device, tasks=ALL_TASKS):
+    bartscore_scorer = BartScoreScorer(checkpoint, device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=bartscore_scorer.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/bartscore-{checkpoint.replace('/','-')}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"bartscore-{checkpoint.replace('/','-')}")
+### Below are Baselines for SummaC
+def eval_mnli(model="roberta-large-mnli", device='cuda:0', tasks=ALL_TASKS):
+    mnli_scorer = MNLIScorer(model=model, device=device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=mnli_scorer.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/mnli-{model}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"mnli-{model}")
+def eval_ner(tasks=ALL_TASKS):
+    ner_scorer = NERScorer()
+    evaluator = Evaluator(eval_tasks=tasks, align_func=ner_scorer.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/NER"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"NER")
+def eval_unieval(tasks=ALL_TASKS, device='cuda:0'):
+    unieval = UniEvalScorer(task='fact', device=device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=unieval.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/UniEval"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"UniEval")
+def eval_feqa(tasks=ALL_TASKS):
+    feqa = FEQAScorer()
+    evaluator = Evaluator(eval_tasks=tasks, align_func=feqa.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/FEQA"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"FEQA")
+def eval_questeval(tasks=ALL_TASKS):
+    questeval = QuestEvalScorer()
+    evaluator = Evaluator(eval_tasks=tasks, align_func=questeval.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/QuestEval"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"QuestEval")
+def eval_qafacteval(tasks=ALL_TASKS, device='cuda:0'):
+    import os, sys
+    warning("using conda env qaeval!!!")
+    qafacteval = QAFactEvalScorer(device=device, model_folder=os.path.abspath('../BaselineForNLGEval/QAFactEval/models'))
+    evaluator = Evaluator(eval_tasks=tasks, align_func=qafacteval.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/QAFactEval"
+    evaluator.evaluate()
+def eval_dae(tasks=ALL_TASKS, model_dir=None, device=0):
+    dae = DAEScorer(model_dir=model_dir, device=device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=dae.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/DAE"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"DAE")
+def eval_bleu(tasks=ALL_TASKS, n_grams=1):
+    bleu = BLEUScorer(n_grams=n_grams)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=bleu.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/BLEU-{n_grams}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"BLEU-{n_grams}")
+def eval_rouge(tasks=ALL_TASKS, rouge_type='1'):
+    rouge = ROUGEScorer(rouge_type=rouge_type)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=rouge.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/ROUGE-{rouge_type}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"ROUGE-{rouge_type}")
+def eval_factcc(script_path, test_data_path,result_path, tasks=ALL_TASKS):
+    factcc = FactCCScorer(script_path=script_path, test_data_path=test_data_path, result_path=result_path)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=factcc.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/FactCC"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"FactCC")
+def eval_blanc(tasks=ALL_TASKS, device='cuda:0', batch_size=64):
+    blanc = BLANCScorer(device=device, batch_size=batch_size)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=blanc.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/BLANC"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"BLANC")
+def eval_summac(tasks=ALL_TASKS, summac_type='conv', device='cuda:0'):
+    summac = SummaCScorer(summac_type=summac_type, device=device)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=summac.scorer, save_all_tables=SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"baselines/SummaC-{summac_type}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(f"SummaC-{summac_type}")
+def eval_align_nlg(ckpt_path, comment='', base_model='roberta-large', batch_size=32, device='cuda:0', tasks=ALL_TASKS, nlg_eval_mode='nli_sp'):
+    align = Inferencer(ckpt_path=ckpt_path, model=base_model, batch_size=batch_size, device=device)
+    if 'smart' in nlg_eval_mode:
+        align.smart_type = nlg_eval_mode
+    else:
+        align.nlg_eval_mode = nlg_eval_mode
+    evaluator = Evaluator(eval_tasks=tasks, align_func=align.nlg_eval, save_all_tables=SAVE_ALL_TABLES)
+    name = f'AlignScore-{nlg_eval_mode}-{base_model}'
+    if comment:
+        name += '_' + comment
+    evaluator.result_save_name = f"align_eval/{name}"
+    timer = Timer()
+    evaluator.evaluate()
+    timer.finish(name)
+def eval_gptscore(api_key, gpt_model='davinci003', tasks=ALL_TASKS):
+        gptscore = GPTScoreScorer(api_key=api_key, gpt_model=gpt_model)
+        evaluator = Evaluator(eval_tasks=tasks, align_func=gptscore.scorer, is_save_all_tables=IS_SAVE_ALL_TABLES)
+        evaluator.result_save_name = f"nlg_eval_fact/baselines/GPTScore-{gpt_model}"
+        evaluator.evaluate()
+def eval_chatgptluo2023(api_key, chat_model='gpt-3.5-turbo', tasks=['qags_cnndm']):
+    chatgpt = ChatGPTLuo2023Scorer(task=tasks, api_key=api_key, chat_model=chat_model)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=chatgpt.scorer, is_save_all_tables=IS_SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"nlg_eval_fact/baselines/ChatGPTLuo2023-{chat_model}"
+    evaluator.evaluate()
+def eval_chatgptgao2023(api_key, chat_model='gpt-3.5-turbo', tasks=['qags_cnndm']):
+    chatgpt = ChatGPTGao2023Scorer(task=tasks, api_key=api_key, chat_model=chat_model)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=chatgpt.scorer, is_save_all_tables=IS_SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"nlg_eval_fact/baselines/ChatGPTGao2023-{chat_model}"
+    evaluator.evaluate()
+def eval_chatgptyichen2023(api_key, chat_model='gpt-3.5-turbo', tasks=['qags_cnndm']):
+    chatgpt = ChatGPTYiChen2023Scorer(task=tasks, api_key=api_key, chat_model=chat_model)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=chatgpt.scorer, is_save_all_tables=IS_SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"nlg_eval_fact/baselines/ChatGPTYiChen2023-{chat_model}"
+    evaluator.evaluate()
+def eval_chatgptshiqichen2023(api_key, chat_model='gpt-3.5-turbo', tasks=['qags_cnndm']):
+    chatgpt = ChatGPTShiqiChen2023Scorer(task=tasks, api_key=api_key, chat_model=chat_model)
+    evaluator = Evaluator(eval_tasks=tasks, align_func=chatgpt.scorer, is_save_all_tables=IS_SAVE_ALL_TABLES)
+    evaluator.result_save_name = f"nlg_eval_fact/baselines/ChatGPTShiqiChen2023-{chat_model}"
+        evaluator.evaluate()
+def run_benchmarks(args, argugment_error):
+    os.makedirs('exp_results/baselines', exist_ok=True)
+    os.makedirs('exp_results/align_eval', exist_ok=True)
+    if args.alignscore:
+        if not all((args.alignscore_model, args.alignscore_ckpt, args.alignscore_eval_mode)):
+            argugment_error('--alignscore-model, --alignscore-model, and --alignscore-ckpt must be specified to run AlignScore')
+        eval_align_nlg(
+            nlg_eval_mode=args.alignscore_eval_mode,
+            ckpt_path=args.alignscore_ckpt,
+            base_model=args.alignscore_model,
+            device=args.device, tasks=args.tasks,
+            comment=args.alignscore_comment
+        )
+    if args.ctc:
+        if not args.ctc_type:
+            argugment_error('--ctc-type must be specified to run CTC baseline')
+        for type in args.ctc_type:
+            eval_ctc(type, tasks=args.tasks)
+    if args.simcse:
+        if not args.simcse_ckpt:
+            argugment_error('--simcse-ckpt must be specified to run SimCSE baseline')
+        for ckpt in args.simcse_ckpt:
+            eval_simcse(ckpt, device=args.device, tasks=args.tasks)
+    if args.bleurt:
+        if not args.bleurt_ckpt:
+            argugment_error('--bleurt-ckpt must be specified to run BLEURT baseline')
+        eval_bleurt(args.bleurt_ckpt, tasks=args.tasks)
+    if args.bertscore:
+        if not args.bertscore_ckpt or not args.bertscore_batch_size:
+            argugment_error('--bertscore-ckpt and --bertscore-batch-size must be specified to run BERTScore baseline')
+        for ckpt in args.bertscore_ckpt:
+            eval_bertscore(ckpt, device=args.device, tasks=args.tasks, batch_size=args.bertscore_batch_size)
+    if args.bartscore:
+        if not args.bartscore_ckpt:
+            argugment_error('--bartscore-ckpt must be specified to run BARTScore baseline')
+        for ckpt in args.bartscore_ckpt:
+            eval_bartscore(ckpt, device=args.device, tasks=args.tasks)
+    if args.mnli:
+        if not args.mnli_ckpt:
+            argugment_error('--mnli-ckpt must be specified to run MNLI baseline')
+        for ckpt in args.mnli_ckpt:
+            eval_mnli(model=ckpt, device=args.device, tasks=args.tasks)
+    if args.ner:
+        eval_ner(tasks=args.tasks)
+    if args.unieval:
+        eval_unieval(tasks=args.tasks, device=args.device)
+    if args.feqa:
+        eval_feqa(tasks=args.tasks)
+    if args.questeval:
+        eval_questeval(tasks=args.tasks)
+    if args.qafacteval:
+        eval_qafacteval(tasks=args.tasks)
+    if args.bleu:
+        if not args.bleu_ngram:
+            argugment_error('--bleu-ngram must be specified to run BLEU baseline')
+        for n in args.bleu_ngram:
+            eval_bleu(tasks=args.tasks, n_grams=n)
+    if args.rouge:
+        if not args.rouge_type:
+            argugment_error('--rouge-type must be specified to run ROUGE baseline')
+        for type in args.rouge_type:
+            eval_rouge(tasks=args.tasks, rouge_type=type)
+    if args.dae:
+        if not args.dae_ckpt:
+            argugment_error('--dae-ckpt must be specified to run DAE baseline')
+        eval_dae(tasks=args.tasks, model_dir=os.path.abspath(args.dae_ckpt))
+    if args.factcc:
+        if not all((args.factcc_script, args.factcc_test_data, args.factcc_result_path)):
+            argugment_error('--factcc-script, --factcc-test-data, and --factcc-result-path must be specified to run FactCC baseline')
+        eval_factcc(
+            tasks=args.tasks,
+            script_path=os.path.abspath(args.factcc_script),
+            test_data_path=os.path.abspath(args.factcc_test_data),
+            result_path=os.path.abspath(args.factcc_result_path)
+        )
+    if args.blanc:
+        if not args.blanc_batch_size:
+            argugment_error('--blanc-batch-size must be specified to run BLANC baseline')
+        eval_blanc(tasks=args.tasks, device=args.device, batch_size=args.blanc_batch_size)
+    if args.summac:
+        if not args.summac_type:
+            argugment_error('--summac-type must be specified to run SummaC baseline')
+        for type in args.summac_type:
+            eval_summac(tasks=args.tasks, device=args.device, summac_type=type)
+if __name__ == "__main__":
+    FACT_EVAL_TASKS = ['summac', 'true','xsumfaith', 'summeval', 'qags_xsum', 'qags_cnndm', 'newsroom', 'rank19', 'frank', 'samsum']
+    parser = ArgumentParser()
+    parser.add_argument('--tasks', nargs='+', type=str, default=FACT_EVAL_TASKS, choices=FACT_EVAL_TASKS)
+    parser.add_argument('--device', type=str, default='cuda:0')
+    parser.add_argument('--timer', action='store_true', help='Time all metric runs')
+    alignscore_parser = parser.add_argument_group('AlignScore')
+    alignscore_parser.add_argument('--alignscore', action='store_true', help='Run AlignScore benchmark')
+    alignscore_parser.add_argument('--alignscore-model', type=str, choices=['roberta-base', 'roberta-large'])
+    alignscore_parser.add_argument('--alignscore-ckpt', type=str)
+    alignscore_parser.add_argument(
+        '--alignscore-eval-mode',
+        type=str,
+        choices=['bin', 'bin_sp', 'nli', 'nli_sp', 'reg', 'reg_sp', 'smart-n', 'smart-l'],
+        default='nli_sp'
+    )
+    alignscore_parser.add_argument('--alignscore-comment', type=str, default='')
+    ctc_parser = parser.add_argument_group('Baseline - CTC')
+    ctc_parser.add_argument('--ctc', action='store_true', help='Run CTC baseline')
+    ctc_parser.add_argument(
+        '--ctc-type',
+        nargs='*',
+        type=str,
+        choices=['D-cnndm', 'E-roberta', 'R-cnndm'],
+        default=['D-cnndm']
+    )
+    simcse_parser = parser.add_argument_group('Baseline - SimCSE')
+    simcse_models = [
+        'princeton-nlp/unsup-simcse-bert-base-uncased',
+        'princeton-nlp/unsup-simcse-bert-large-uncased',
+        'princeton-nlp/unsup-simcse-roberta-base',
+        'princeton-nlp/unsup-simcse-roberta-large',
+        'princeton-nlp/sup-simcse-bert-base-uncased',
+        'princeton-nlp/sup-simcse-bert-large-uncased',
+        'princeton-nlp/sup-simcse-roberta-base',
+        'princeton-nlp/sup-simcse-roberta-large'
+    ]
+    simcse_parser.add_argument('--simcse', action='store_true', help='Run SimCSE baseline')
+    simcse_parser.add_argument(
+        '--simcse-ckpt',
+        nargs='*',
+        type=str,
+        choices=simcse_models,
+        default=['princeton-nlp/sup-simcse-roberta-large']
+    )
+    bleurt_parser = parser.add_argument_group('Baseline - BLEURT')
+    bleurt_parser.add_argument('--bleurt', action='store_true', help='Run BLEURT baseline')
+    bleurt_parser.add_argument('--bleurt-ckpt', type=str)
+    bertscore_parser = parser.add_argument_group('Baseline - BERTScore')
+    bertscore_parser.add_argument('--bertscore', action='store_true', help='Run BERTScore baseline')
+    bertscore_parser.add_argument(
+        '--bertscore-ckpt',
+        nargs='*',
+        type=str,
+            default=['microsoft/deberta-xlarge-mnli']
+    )
+    bertscore_parser.add_argument('--bertscore-batch-size', type=int, default=16)
+    bartscore_parser = parser.add_argument_group(
+        'Baseline - BARTScore',
+        description='Please clone https://github.com/neulab/BARTScore to baselines/BARTScore.'
+    )
+    bartscore_parser.add_argument('--bartscore', action='store_true', help='Run BARTScore baseline')
+    bartscore_parser.add_argument(
+        '--bartscore-ckpt',
+        type=str,
+        nargs='*',
+        default=['facebook/bart-large-cnn']
+    )
+    mnli_parser = parser.add_argument_group('Baseline - MNLI')
+    mnli_parser.add_argument('--mnli', action='store_true', help='Run MNLI baseline')
+    mnli_parser.add_argument(
+        '--mnli-ckpt',
+        nargs='*',
+        type=str,
+        default=['roberta-large-mnli']
+    )
+    ner_parser = parser.add_argument_group(
+        'Baseline - NER overlap',
+        description='Please clone https://github.com/tingofurro/summac to baselines/summac.'
+    )
+    ner_parser.add_argument('--ner', action='store_true', help='Run NER overlap baseline')
+    unieval_parser = parser.add_argument_group(
+        'Baseline - UniEval',
+        description='Please clone https://github.com/maszhongming/UniEval to baselines/UniEval.'
+    )
+    unieval_parser.add_argument('--unieval', action='store_true', help='Run UniEval baseline')
+    feqa_parser = parser.add_argument_group(
+        'Baseline - FEQA',
+        description='Please clone https://github.com/esdurmus/feqa to baselines/feqa'
+    )
+    feqa_parser.add_argument('--feqa', action='store_true', help='Run FEQA baseline')
+    questeval_parser = parser.add_argument_group(
+        'Baseline - QuestEval',
+        description='Please clone https://github.com/ThomasScialom/QuestEval to baselines/QuestEval.'
+    )
+    questeval_parser.add_argument('--questeval', action='store_true', help='Run QuestEval baseline')
+    qafacteval_parser = parser.add_argument_group(
+        'Baseline - QAFactEval',
+        description='Please clone https://github.com/salesforce/QAFactEval to baselines/QAFactEval.'
+    )
+    qafacteval_parser.add_argument('--qafacteval', action='store_true', help='Run QAFactEval baseline')
+    bleu_parser = parser.add_argument_group('Baseline - BLEU')
+    bleu_parser.add_argument('--bleu', action='store_true', help='Run BLEU baseline')
+    bleu_parser.add_argument(
+        '--bleu-ngram',
+        nargs='*',
+        type=int,
+        choices=[1, 2, 3, 4],
+        default=[1, 2, 3, 4]
+    )
+    rouge_parser = parser.add_argument_group('Baseline - ROUGE')
+    rouge_parser.add_argument('--rouge', action='store_true', help='Run ROUGE baseline')
+    rouge_parser.add_argument(
+        '--rouge-type',
+        nargs='*',
+        type=str,
+        choices=['1', '2', 'l'],
+        default=['1', '2', 'l']
+    )
+    dae_parser = parser.add_argument_group('Baseline - DAE')
+    dae_parser.add_argument('--dae', action='store_true', help='Run DAE baseline')
+    dae_parser.add_argument('--dae-ckpt', type=str)
+    factcc_parser = parser.add_argument_group('Baseline - FactCC')
+    factcc_parser.add_argument('--factcc', action='store_true', help='Run FactCC baseline')
+    factcc_parser.add_argument('--factcc-script', type=str)
+    factcc_parser.add_argument('--factcc-test-data', type=str)
+    factcc_parser.add_argument('--factcc-result-path', type=str)
+    blanc_parser = parser.add_argument_group('Baseline - BLANC')
+    blanc_parser.add_argument('--blanc', action='store_true', help='Run BLANC baseline')
+    blanc_parser.add_argument('--blanc-batch-size', type=int, default=64)
+    summac_parser = parser.add_argument_group(
+        'Baseline - SummaC',
+        description='Please clone https://github.com/tingofurro/summac to baselines/summac.'
+    )
+    summac_parser.add_argument('--summac', action='store_true', help='Run SummaC baseline')
+    summac_parser.add_argument('--summac-type', nargs='*', type=str, choices=['conv', 'zs'], default=['conv', 'zs'])
+    args = parser.parse_args()
+    if args.timer:
+        SAVE_AND_PRINT_TIMER = True
+    def argugment_error(msg):
+        parser.error(msg)
+    run_benchmarks(args, argugment_error)

alignscore/evaluate.py ADDED Viewed

	@@ -0,0 +1,1793 @@

+from logging import warning
+from datasets import load_dataset
+from alignscore.inference import Inferencer
+import numpy as np
+from scipy.stats import pearsonr, kendalltau, spearmanr
+from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, balanced_accuracy_score, matthews_corrcoef
+import pandas as pd
+import torch
+import json
+import pickle
+import os
+HUGGINGFACE_DATASETS = {
+    'stsb': ['glue', 'stsb', 'validation'],
+    'mrpc': ['glue', 'mrpc', 'test'],
+    'axb': ['super_glue', 'axb', 'test'],
+    'axg': ['super_glue', 'axg', 'test'],
+    'cb': ['super_glue', 'cb', 'validation'],
+    'rte': ['super_glue', 'rte', 'validation'],
+    'wnli': ['SetFit/wnli', 'validation'],
+    'paws': ['paws', 'labeled_final', 'test'],
+    'mnli_matched': ['multi_nli', 'validation_matched'],
+    'mnli_mismatched': ['multi_nli', 'validation_mismatched'],
+    'nli_fever': ['pietrolesci/nli_fever', 'dev'],
+    'doc_nli': ['saattrupdan/doc-nli', 'test'],
+    'sem_eval': ['sem_eval_2014_task_1', 'test'],
+    'sick': ['sick', 'default', 'test'],
+    'race_m': ['race', 'middle', 'test'],
+    'race_h': ['race', 'high', 'test'],
+    'boolq': ['boolq', 'validation'],
+    'anli_1': ['anli', 'test_r1'],
+    'anli_2': ['anli', 'test_r2'],
+    'anli_3': ['anli', 'test_r3'],
+    'snli': ['snli', 'test'],
+    'vitaminc': ['tals/vitaminc', 'test'],
+    'qqp': ['glue', 'qqp', 'validation'],
+    # below are tasks from https://arxiv.org/pdf/2104.14690.pdf
+    'sst2': ['SetFit/sst2', 'test'],
+    # can't find MR
+    'cr': ['SetFit/SentEval-CR', 'test'],
+    # can't find MPQA
+    'subj': ['SetFit/subj', 'test'],
+    # can't find OS
+    'imdb': ['SetFit/imdb', 'test'], # note: I can't confirm if this is the same dataset used in that paper
+                                     # The original dataset is no longer accessiable
+    'cola': ['glue', 'cola', 'validation'],
+    'yelp_efl': ['SetFit/yelp_review_full', 'test'],
+    'ag_news': ['SetFit/ag_news', 'test'],
+    'trec': ['SetFit/TREC-QC', 'test',],
+    'dream': ['dream', 'test'],
+    'quartz': ['quartz', 'test'],
+    'eraser_multi_rc': ['eraser_multi_rc', 'test'],
+    'quail': ['quail', 'challenge'],
+    'sciq': ['sciq', 'test'],
+    'gap': ['gap', 'test'],
+    'qnli': ['glue', 'qnli', 'validation']
+}
+PICKLE_DATASETS = [
+    'newsroom',
+    'rank19',
+    'bagel',
+    'sfhot',
+    'sfres'
+]
+ALL_TASKS = { # enumerate all possible tasks
+    'stsb': 0, ### using which output: regression, binary, tri-label
+    'sick': 0,
+    'race_m': 1,
+    'race_h': 1,
+    'boolq': 1,
+    'anli_1': 2,
+    'anli_2': 2,
+    'anli_3': 2,
+    'snli': 2,
+    'vitaminc': 2,
+    'mrpc': 1,
+    'paws': 1,
+    'mnli_matched': 2,
+    'mnli_mismatched': 2,
+    'sem_eval': 1,
+    'summeval': 1,
+    'qags_xsum': 1,
+    'qags_cnndm': 1,
+    'frank': 1,
+    'xsumfaith': 1,
+    'samsum': 1,
+    'yelp': 1,
+    'persona_chat': 1,
+    'topical_chat': 1,
+    'paws_qqp': 1,
+    'qqp': 1,
+    'newsroom': 1,
+    'rank19': 1,
+    'bagel': 1,
+    'sfhot': 1,
+    'sfres': 1,
+    'wmt17': 0,
+    'wmt18': 0,
+    'wmt19': 0,
+    'sst2': 1,
+    'cr': 1,
+    'subj': 1,
+    'imdb': 1,
+    'cola': 1,
+    'yelp_efl': 1,
+    'ag_news': 1,
+    'trec': 1,
+    'axb': 1,
+    'axg': 1,
+    'cb': 2,
+    'rte': 2,
+    'wnli': 2,
+    'dream': 1,
+    'quartz': 1,
+    'nli_fever': 2,
+    'doc_nli': 1,
+    'eraser_multi_rc': 1,
+    'quail': 1,
+    'sciq': 1,
+    'gap': 1,
+    'qnli': 1
+}
+FEW_SHOT_N = 8
+FEW_SHOT_SEEDS = [30247, 38252, 29050, 1091, 35554, 25309, 79319, 35079, 35256, 46744]
+class Evaluator():
+    def __init__(self, eval_tasks, align_func, save_all_tables=False, clean_data=True) -> None:
+        self.align_func = align_func
+        self.eval_tasks = eval_tasks # ['stsb', 'paws', ...]
+        self.result_save_name = "Default_result_name"
+        self.result_tables = []
+        self.result_dicts = []
+        self.clean_data = clean_data
+        self.init_eval_dataset()
+        self.should_save_all_tables = save_all_tables
+        warning(f"Saving the result is: {self.should_save_all_tables}")
+    def init_eval_dataset(self):
+        self.dataset = dict()
+        for eval_task in self.eval_tasks:
+            if eval_task in HUGGINGFACE_DATASETS:
+                if len(HUGGINGFACE_DATASETS[eval_task]) == 3:
+                    self.dataset[eval_task] = load_dataset(HUGGINGFACE_DATASETS[eval_task][0], HUGGINGFACE_DATASETS[eval_task][1])[HUGGINGFACE_DATASETS[eval_task][2]]
+                elif len(HUGGINGFACE_DATASETS[eval_task]) == 2:
+                    if isinstance(HUGGINGFACE_DATASETS[eval_task][1], tuple):
+                        dataset = load_dataset(HUGGINGFACE_DATASETS[eval_task][0])
+                        self.dataset[eval_task] = {split:dataset[split] for split in HUGGINGFACE_DATASETS[eval_task][1]}
+                    else:
+                        self.dataset[eval_task] = load_dataset(HUGGINGFACE_DATASETS[eval_task][0])[HUGGINGFACE_DATASETS[eval_task][1]]
+            elif eval_task == 'paws_qqp':
+                self.dataset[eval_task] = pd.read_csv('data/paws_qqp/output/dev_and_test.tsv', sep='\t')
+            elif eval_task == 'beir':
+                print("beir load by itself")
+                self.dataset[eval_task] = "BEIR Benchmark"
+            elif eval_task in PICKLE_DATASETS:
+                with open(f'data/eval/{eval_task}.pkl', 'rb') as f:
+                    self.dataset[eval_task] = pickle.load(f)
+            elif 'wmt' in eval_task:
+                self.dataset[eval_task] = []
+                with open(f'data/eval/{eval_task}_eval.jsonl', 'r', encoding='utf8') as f:
+                    for example in f:
+                        self.dataset[eval_task].append(json.loads(example))
+            elif 'true' == eval_task:
+                for each_true_sub in os.listdir('data/eval/true'):
+                    if 'qags' in each_true_sub:
+                        each_true_sub_name = 'true_' + '_'.join(each_true_sub.split('_')[:2])
+                    else:
+                        each_true_sub_name = 'true_' + '_'.join(each_true_sub.split('_')[:1])
+                    self.dataset[each_true_sub_name] = pd.read_csv(os.path.join('data/eval/true', each_true_sub))
+            elif 'summac' == eval_task:
+                from summac.benchmark import SummaCBenchmark
+                self.summac_validation_set = dict()
+                summac_benchmark = SummaCBenchmark(benchmark_folder="./data/eval/summac/benchmark", cut='test')
+                for each in summac_benchmark.datasets:
+                    summac_dt_name = each['name']
+                    self.dataset['summac_'+summac_dt_name] = each['dataset']
+                summac_benchmark_valid = SummaCBenchmark(benchmark_folder="./data/eval/summac/benchmark", cut='val')
+                for each in summac_benchmark_valid.datasets:
+                    summac_dt_name = each['name']
+                    self.summac_validation_set['summac_'+summac_dt_name] = each['dataset']
+            else:
+                f = open(f'data/eval/{eval_task}.json')
+                self.dataset[eval_task] = json.load(f)
+                f.close()
+    def print_result_table(self, table):
+        self.result_tables.append(pd.DataFrame(table).to_markdown())
+        self.result_dicts.append(table)
+        print(self.result_tables[-1])
+    def print_all_tables(self):
+        print("\n All Evaluation Results:")
+        for each in self.result_tables:
+            print(each)
+            print('='*100)
+    def save_all_tables(self):
+        with open(f'exp_results/{self.result_save_name}.pkl', 'wb') as f:
+            pickle.dump(self.result_dicts, f, protocol=pickle.HIGHEST_PROTOCOL)
+    def evaluate(self):
+        for each_task in self.dataset:
+            eval(f'self.evaluate_{each_task}()')
+        if self.should_save_all_tables:
+            self.save_all_tables()
+    def get_accuracy(self, true_score, pred_score):
+        return [accuracy_score(true_score, [m>0.5 for m in pred_score])]
+    def get_balanced_accuracy(self, true_score, pred_score, thres=0.5):
+        return [balanced_accuracy_score(true_score, [m>thres for m in pred_score])]
+    def get_f1(self, true_score, pred_score):
+        return [f1_score(true_score, [m>0.5 for m in pred_score])]
+    def get_3label_f1(self, true_score, pred_score):
+        return [f1_score(true_score, pred_score, average='micro')]
+    def get_pearson(self, true_score, pred_score):
+        return pearsonr(pred_score, true_score)
+    def get_kendalltau(self, true_score, pred_score):
+        return kendalltau(pred_score, true_score)
+    def get_spearman(self, true_score, pred_score):
+        return spearmanr(pred_score, true_score)
+    def get_matthews_corr(self, true_score, pred_score):
+        return [matthews_corrcoef(true_score, [s>0.5 for s in pred_score])]
+    def clean_text(self, context, claims):
+        from nltk.tokenize import sent_tokenize
+        if not self.clean_data:
+            return claims
+        word_cases = {token.lower():token for token in context.strip().split()}
+        def clean(text):
+            text = ' '.join(word_cases.get(token.lower(), token) for token in text.strip().split())
+            text = text.replace('“', '"').replace('”', '"').replace('’', '\'').replace('‘', '\'').replace('`', '\'').replace('-lrb-', '(').replace('-rrb-', ')')
+            text= ' '.join(each.strip()[0].capitalize()+each.strip()[1:] for each in sent_tokenize(text))
+            return text
+        if isinstance(claims, str):
+            return clean(claims)
+        return [clean(text) for text in claims]
+    def evaluate_newsroom(self):
+        true_score = []
+        true_score_rel = []
+        true_score_binary = []
+        sent1 = []
+        sent2 = []
+        for sample in self.dataset['newsroom'].values():
+            summaries, informativeness, relevance = zip(*(
+                (s['sys_summ'], s['scores']['informativeness'], s['scores']['relevance'])
+                 for s in sample['sys_summs'].values()
+            ))
+            cleaned_summaries = self.clean_text(sample['src'], summaries)
+            for summary, inf_score, rel_score in zip(cleaned_summaries, informativeness, relevance):
+                sent1.append(sample['src'])
+                sent2.append(summary)
+                true_score.append(inf_score)
+                true_score_rel.append(rel_score)
+                true_score_binary.append(int(inf_score >= 4))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['newsroom']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'newsroom',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score_binary, pred_score),
+            'Pearson_rel': self.get_pearson(true_score_rel, pred_score),
+            'Spearman_rel': self.get_spearman(true_score_rel, pred_score),
+            'Kendall_rel': self.get_kendalltau(true_score_rel, pred_score),
+        })
+    def evaluate_rank19(self):
+        def chunks(lst, n):
+            """Yield successive n-sized chunks from lst."""
+            for i in range(0, len(lst), n):
+                yield lst[i:i + n]
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['rank19']:
+            for example_summs in self.dataset['rank19'][example]['sys_summs']:
+                sent1.append(self.dataset['rank19'][example]['src'])
+                sent2.append(self.dataset['rank19'][example]['sys_summs'][example_summs]['sys_summ'])
+                true_score.append(self.dataset['rank19'][example]['sys_summs'][example_summs]['scores']['fact'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['rank19']].tolist()
+        pred_score_bin = []
+        assert len(pred_score) % 2 == 0
+        for i, pair in enumerate(chunks(pred_score, 2)):
+            pred_score_bin.extend([0, 1] if pair[1] > pair[0] else [1, 0])
+        self.print_result_table({
+            'Dataset_name': 'rank19',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score_bin)[0],
+            'AUC': roc_auc_score(true_score, pred_score_bin)
+        })
+    def evaluate_bagel(self):
+        true_score = []
+        true_score_binary = []
+        sent1 = []
+        sent2 = []
+        pred_score = []
+        for example in self.dataset['bagel']:
+            sent1.append(' '.join(self.dataset['bagel'][example]['ref_summs']))
+            sent2.append(self.dataset['bagel'][example]['sys_summ'])
+            true_score.append(self.dataset['bagel'][example]['scores']['informativeness'])
+            if(self.dataset['bagel'][example]['scores']['informativeness'] >= 4.0):
+                true_score_binary.append(1)
+            else:
+                true_score_binary.append(0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['bagel']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'bagel',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score_binary, pred_score)
+        })
+    def evaluate_sfhot(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        pred_score = []
+        for example in self.dataset['sfhot']:
+            for ref in self.dataset['sfhot'][example]['ref_summs']:
+                sent1.append(self.dataset['sfhot'][example]['sys_summ'])
+                sent2.append(ref)
+            pred_score.append(max(self.align_func(sent1, sent2)[ALL_TASKS['sfhot']].tolist()))
+            sent1 = []
+            sent2 = []
+            if(self.dataset['sfhot'][example]['scores']['quality'] >= 4.0):
+                true_score.append(1)
+            else:
+                true_score.append(0)
+        self.print_result_table({
+            'Dataset_name': 'sfhot',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_sfres(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        pred_score = []
+        for example in self.dataset['sfres']:
+            for ref in self.dataset['sfres'][example]['ref_summs']:
+                sent1.append(self.dataset['sfres'][example]['sys_summ'])
+                sent2.append(ref)
+            pred_score.append(max(self.align_func(sent1, sent2)[ALL_TASKS['sfres']].tolist()))
+            sent1 = []
+            sent2 = []
+            if(self.dataset['sfres'][example]['scores']['quality'] >= 4.0):
+                true_score.append(1)
+            else:
+                true_score.append(0)
+        self.print_result_table({
+            'Dataset_name': 'sfres',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_stsb(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['stsb']:
+            sent1.append(example['sentence1'])
+            sent2.append(example['sentence2'])
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['stsb']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'stsb',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+    def evaluate_sick(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['sick']:
+            sent1.append(example['sentence_A'])
+            sent2.append(example['sentence_B'])
+            true_score.append(example['relatedness_score'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['sick']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'sick-r',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+    def evaluate_race_m(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['race_m']:
+            for i, option in enumerate(example['options']):
+                article.append(example['article'])
+                qa.append(example['question']+" "+option+" " if "_" not in example['question'] else ' '.join(example['question'].replace("_", " "+option+" ").split()))
+                if i == ord(example['answer'])-65:
+                    true_score.append(i) # 0,1,2,3
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['race_m']].tolist()
+        for a, b, c, d in zip(*[iter(pred_score_temp)]*4):
+            arr = [0]*4
+            pred_score.append(np.argmax([a,b,c,d]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'race-m',
+            'Accuracy': [acc],
+        })
+    def evaluate_race_h(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['race_h']:
+            for i, option in enumerate(example['options']):
+                article.append(example['article'])
+                qa.append(example['question']+" "+option+" " if "_" not in example['question'] else ' '.join(example['question'].replace("_", " "+option+" ").split()))
+                if i == ord(example['answer'])-65:
+                    true_score.append(i) # 0,1,2,3
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['race_h']].tolist()
+        for a, b, c, d in zip(*[iter(pred_score_temp)]*4):
+            pred_score.append(np.argmax([a,b,c,d]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'race-h',
+            'Accuracy': [acc]
+        })
+    # How to combine passage, question, and single answer for boolq
+    def evaluate_boolq(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['boolq']:
+            for i in range(2):
+                article.append(example['passage'])
+                if i == 0:
+                    qa.append(example['question']+" "+"No.") # 0
+                else:
+                    qa.append(example['question']+" "+"Yes.") # 1
+            true_score.append(int(example['answer']))
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['boolq']].tolist()
+        for a, b in zip(*[iter(pred_score_temp)]*2):
+            pred_score.append(np.argmax([a,b]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'boolq',
+            'Accuracy': [acc]
+        })
+    def evaluate_anli_1(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['anli_1']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['anli_1']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'anli-1',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_anli_2(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['anli_2']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['anli_2']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'anli-2',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_anli_3(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['anli_3']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['anli_3']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'anli-3',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_nli_fever(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['nli_fever']:
+            sent1.append(example['hypothesis']) # the original dataset flipped
+            sent2.append(example['premise'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['nli_fever']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'nli_fever',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_snli(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['snli']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['snli']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'snli',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_axb(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['axb']:
+            sent1.append(example['sentence1'])
+            sent2.append(example['sentence2'])
+            true_score.append(1 if example['label']==0 else 0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['axb']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'axb',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+            'Matthews': self.get_matthews_corr(true_score, pred_score)
+        })
+    def evaluate_axg(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['axg']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(1 if example['label']==0 else 0)
+        pred_score = self.align_func(sent1, sent2)[2][:,0].tolist()
+        self.print_result_table({
+            'Dataset_name': 'axg',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+        })
+    def evaluate_cb(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['cb']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            if example['label'] == 0:
+                label = 0
+            elif example['label'] == 1:
+                label = 2
+            elif example['label'] == 2:
+                label = 1
+            true_score.append(label)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['cb']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'cb',
+            'Accuracy': [accuracy_score(true_score, pred_score)],
+        })
+    def evaluate_rte(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['rte']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(1 if example['label']==0 else 0)
+        pred_score = self.align_func(sent1, sent2)[2][:,0].tolist()
+        self.print_result_table({
+            'Dataset_name': 'rte',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+        })
+    def evaluate_wnli(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['wnli']:
+            sent1.append(example['text1'])
+            sent2.append(example['text2'])
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[2][:,0].tolist()
+        self.print_result_table({
+            'Dataset_name': 'wnli',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+        })
+    def evaluate_doc_nli(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['doc_nli']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(1 if example['label'] == 'entailment' else 0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['doc_nli']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'doc_nli',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+        })
+    def evaluate_qnli(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['qnli']:
+            sent1.append(example['sentence'])
+            sent2.append(example['question'])
+            true_score.append(1 if example['label'] == 0 else 0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['qnli']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'qnli',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)],
+        })
+    def evaluate_dream(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['dream']:
+            for i, option in enumerate(example['choice']):
+                article.append(' '.join(example['dialogue']))
+                qa.append(example['question']+" "+option+" ")
+                if option == example['answer']:
+                    true_score.append(i) # 0,1,2,3
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['dream']].tolist()
+        for a, b, c in zip(*[iter(pred_score_temp)]*3):
+            arr = [0]*3
+            pred_score.append(np.argmax([a,b,c]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'dream',
+            'Accuracy': [acc],
+        })
+    def evaluate_quartz(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['quartz']:
+            for i, option in enumerate(example['choices']['text']):
+                article.append(example['para'])
+                qa.append(example['question']+" "+option+" ")
+                if i == ord(example['answerKey'])-65:
+                    true_score.append(i) # 0,1,2,3
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['quartz']].tolist()
+        for a, b in zip(*[iter(pred_score_temp)]*2):
+            arr = [0]*2
+            pred_score.append(np.argmax([a,b]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'quartz',
+            'Accuracy': [acc],
+        })
+    def evaluate_eraser_multi_rc(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['eraser_multi_rc']:
+            sent1.append(example['passage'])
+            sent2.append(example['query_and_answer'].replace("|", ""))
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['eraser_multi_rc']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'eraser_multi_rc',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)]
+        })
+    def evaluate_quail(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['quail']:
+            for i, option in enumerate(example['answers']):
+                article.append(example['context'])
+                qa.append(example['question']+" "+option+" ")
+                if i == example['correct_answer_id']:
+                    true_score.append(i) # 0,1,2,3
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['quail']].tolist()
+        for a, b, c, d in zip(*[iter(pred_score_temp)]*4):
+            arr = [0]*4
+            pred_score.append(np.argmax([a,b,c,d]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'quail',
+            'Accuracy': [acc],
+        })
+    def evaluate_sciq(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['sciq']:
+            options = [example['correct_answer'], example['distractor1'], example['distractor2'], example['distractor3']]
+            for i, option in enumerate(options):
+                article.append(example['support'])
+                qa.append(example['question']+" "+option+" ")
+                if i == 0:
+                    true_score.append(i) # 0,1,2,3, always 0
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['sciq']].tolist()
+        for a, b, c, d in zip(*[iter(pred_score_temp)]*4):
+            arr = [0]*4
+            pred_score.append(np.argmax([a,b,c,d]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'sciq',
+            'Accuracy': [acc],
+        })
+    def evaluate_gap(self):
+        true_score = []
+        article = []
+        qa = []
+        for example in self.dataset['gap']:
+            options = [example['Text'][:example['Pronoun-offset']]+example['A']+example['Text'][(example['Pronoun-offset']+len(example['Pronoun'])):],
+                       example['Text'][:example['Pronoun-offset']]+example['B']+example['Text'][(example['Pronoun-offset']+len(example['Pronoun'])):]]
+            for i, option in enumerate(options):
+                article.append(example['Text'])
+                qa.append(option)
+            true_score.append(1 if example['B-coref'] else 0) # 0,1,2,3, always 0
+        pred_score = []
+        pred_score_temp = self.align_func(article, qa)[ALL_TASKS['gap']].tolist()
+        for a, b in zip(*[iter(pred_score_temp)]*2):
+            pred_score.append(np.argmax([a,b]))
+        assert len(pred_score) == len(true_score)
+        acc = [int(p==t) for p, t in zip(pred_score, true_score)]
+        acc = sum(acc) / len(acc)
+        self.print_result_table({
+            'Dataset_name': 'gap',
+            'Accuracy': [acc],
+        })
+    # How to group fact checking
+    def evaluate_vitaminc(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['vitaminc']:
+            sent1.append(example['evidence'])
+            sent2.append(example['claim'])
+            if example['label'] == 'SUPPORTS':
+                true_score.append(0)
+            elif example['label'] == 'REFUTES':
+                true_score.append(2)
+            else:
+                true_score.append(1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['vitaminc']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'vitaminc',
+            'F1': self.get_3label_f1(true_score, pred_score),
+            'Accuracy': [accuracy_score(true_score, pred_score)],
+        })
+    def evaluate_mrpc(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['mrpc']:
+            sent1.append(example['sentence1'])
+            sent2.append(example['sentence2'])
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['mrpc']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'mrpc',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)]
+        })
+    def evaluate_paws(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['paws']:
+            sent1.append(example['sentence1'])
+            sent2.append(example['sentence2'])
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['paws']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'paws',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)]
+        })
+    def evaluate_mnli_matched(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['mnli_matched']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['mnli_matched']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'mnli_matched',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_mnli_mismatched(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['mnli_mismatched']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            true_score.append(example['label'] if example['label']!=-1 else 1)
+        pred_score = torch.argmax(self.align_func(sent1, sent2)[ALL_TASKS['mnli_mismatched']], dim=-1).tolist()
+        self.print_result_table({
+            'Dataset_name': 'mnli_mismatched',
+            'Accuracy': [accuracy_score(true_score, pred_score)]
+        })
+    def evaluate_sem_eval(self):
+        print('Reached here')
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['sem_eval']:
+            sent1.append(example['premise'])
+            sent2.append(example['hypothesis'])
+            if example['entailment_judgment'] == 1:
+                true_score.append(1)
+            else:
+                true_score.append(0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['sem_eval']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'sem_eval',
+            'Accuracy': self.get_accuracy(true_score, pred_score)
+        })
+    def evaluate_summeval(self):
+        true_score = []
+        true_score_rel = []
+        true_score_binary = []
+        pred_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['summeval']:
+            cleaned_summary = self.clean_text(example['document'], example['summary'])
+            sent1.append(example['document'])
+            sent2.append(cleaned_summary)
+            true_score.append(example['consistency'])
+            true_score_rel.append(example['relevance'])
+            true_score_binary.append(1 if example['consistency'] == 5.0 else 0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['summeval']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'summeval',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score_binary, pred_score),
+            'Pearson_rel': self.get_pearson(true_score_rel, pred_score),
+            'Spearman_rel': self.get_spearman(true_score_rel, pred_score),
+            'Kendall_rel': self.get_kendalltau(true_score_rel, pred_score),
+        })
+    def evaluate_qags_xsum(self):
+        true_score = []
+        pred_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['qags_xsum']:
+            sent1.append(example['document'])
+            sent2.append(example['summary'])
+            true_score.append(example['consistency'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['qags_xsum']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'qags_xsum',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_qags_cnndm(self):
+        true_score = []
+        pred_score = []
+        sent1 = []
+        sent2 = []
+        true_score_binary = []
+        for example in self.dataset['qags_cnndm']:
+            sent1.append(example['document'])
+            sent2.append(example['summary'])
+            true_score.append(example['consistency'])
+            true_score_binary.append(1 if example['consistency'] == 1.0 else 0)
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['qags_cnndm']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'qags_cnndm',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score_binary, pred_score)
+        })
+    def evaluate_frank(self):
+        from spacy.lang.en import English
+        nlp = English()
+        nlp.add_pipe("sentencizer")
+        for d in self.dataset['frank']:
+            if d['dataset'] == 'cnndm':
+                continue
+            d['document'] = ' '.join([each.text for each in nlp(d['document']).sents])
+        true_score_xsum = []
+        true_score_cnndm = []
+        pred_score_xsum = []
+        pred_score_cnndm = []
+        sent1_xsum = []
+        sent1_cnndm = []
+        sent2_xsum = []
+        sent2_cnndm = []
+        true_score_binary_cnndm = []
+        true_score_binary_xsum = []
+        for example in self.dataset['frank']:
+            if example['dataset'] == 'cnndm':
+                sent1_cnndm.append(example['document'])
+                sent2_cnndm.append(self.clean_text(example['document'], example['summary']))
+                true_score_cnndm.append(example['score'])
+                true_score_binary_cnndm.append(1 if example['score'] == 1.0 else 0)
+            elif example['dataset'] == 'xsum':
+                sent1_xsum.append(example['document'])
+                sent2_xsum.append(self.clean_text(example['document'], example['summary']))
+                true_score_xsum.append(example['score'])
+                true_score_binary_xsum.append(1 if example['score'] == 1.0 else 0)
+        pred_score_xsum = self.align_func(sent1_xsum, sent2_xsum)[ALL_TASKS['frank']].tolist() #
+        pred_score_cnndm = self.align_func(sent1_cnndm, sent2_cnndm)[ALL_TASKS['frank']].tolist() #
+        self.print_result_table({
+            'Dataset_name': 'frank-xsum',
+            'Pearson': self.get_pearson(true_score_xsum, pred_score_xsum),
+            'Spearman': self.get_spearman(true_score_xsum, pred_score_xsum),
+            'Kendall': self.get_kendalltau(true_score_xsum, pred_score_xsum),
+            'AUC': roc_auc_score(true_score_binary_xsum, pred_score_xsum)
+        })
+        self.print_result_table({
+            'Dataset_name': 'frank-cnndm',
+            'Pearson': self.get_pearson(true_score_cnndm, pred_score_cnndm),
+            'Spearman': self.get_spearman(true_score_cnndm, pred_score_cnndm),
+            'Kendall': self.get_kendalltau(true_score_cnndm, pred_score_cnndm),
+            'AUC': roc_auc_score(true_score_binary_cnndm, pred_score_cnndm)
+        })
+        self.print_result_table({
+            'Dataset_name': 'frank-all',
+            'Pearson': self.get_pearson(true_score_xsum+true_score_cnndm, pred_score_xsum+pred_score_cnndm),
+            'Spearman': self.get_spearman(true_score_xsum+true_score_cnndm, pred_score_xsum+pred_score_cnndm),
+            'Kendall': self.get_kendalltau(true_score_xsum+true_score_cnndm, pred_score_xsum+pred_score_cnndm),
+            'AUC': roc_auc_score(true_score_binary_xsum+true_score_binary_cnndm, pred_score_xsum+pred_score_cnndm)
+        })
+    def evaluate_xsumfaith(self):
+        dataset_name = 'xsumfaith'
+        true_score = []
+        pred_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset[dataset_name]:
+            sent1.append(example['document'])
+            sent2.append(self.clean_text(example['document'], example['claim']))
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS[dataset_name]].tolist()
+        self.print_result_table({
+            'Dataset_name': dataset_name,
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+        })
+    def evaluate_samsum(self):
+        dataset_name = 'samsum'
+        label_mapping = {
+            'factual': 1,
+            'factually incorrect': 0,
+            'too incoherent': 0
+        }
+        import string
+        printable = set(string.printable)
+        true_score = []
+        pred_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset[dataset_name]:
+            cleaned_doc = ''.join(filter(lambda x: x in printable, example['article']))
+            sent1.append(cleaned_doc)
+            sent2.append(example['summary'])
+            true_score.append(label_mapping[example['label']])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS[dataset_name]].tolist()
+        self.print_result_table({
+            'Dataset_name': dataset_name,
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_yelp(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['yelp']:
+            sent1.append(example['input_sent'])
+            sent2.append(example['output_sent'])
+            true_score.append(example['preservation'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['yelp']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'yelp',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+    def evaluate_persona_chat(self):
+        true_score = []
+        pred_score = []
+        premise = []
+        hypothesis = []
+        for example in self.dataset['persona_chat']:
+            premise.append(example['dialog_history']+example['fact'])
+            hypothesis.append(example['response'])
+            true_score.append(example['engaging'])
+        pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['persona_chat']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'persona_chat_eng',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+        true_score = []
+        pred_score = []
+        premise = []
+        hypothesis = []
+        for example in self.dataset['persona_chat']:
+            premise.append(example['fact'])
+            hypothesis.append(example['response'])
+            true_score.append(example['uses_knowledge'])
+        pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['persona_chat']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'persona_chat_grd',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+    def evaluate_topical_chat(self):
+        true_score = []
+        pred_score = []
+        premise = []
+        hypothesis = []
+        for example in self.dataset['topical_chat']:
+            premise.append(example['dialog_history']+example['fact'])
+            hypothesis.append(example['response'])
+            true_score.append(example['engaging'])
+        pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['topical_chat']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'topical_chat_eng',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+        true_score = []
+        pred_score = []
+        premise = []
+        hypothesis = []
+        for example in self.dataset['topical_chat']:
+            premise.append(example['fact'])
+            hypothesis.append(example['response'])
+            true_score.append(example['uses_knowledge'])
+        pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['topical_chat']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'topical_chat_grd',
+            'Pearson': self.get_pearson(true_score, pred_score),
+            'Spearman': self.get_spearman(true_score, pred_score),
+            'Kendall': self.get_kendalltau(true_score, pred_score)
+        })
+    def evaluate_paws_qqp(self):
+        sent1 = []
+        sent2 = []
+        true_score = []
+        for i in range(self.dataset['paws_qqp']['label'].size):
+            sent1.append(self.dataset['paws_qqp']['sentence1'][i][2:-1])
+            sent2.append(self.dataset['paws_qqp']['sentence2'][i][2:-1])
+            true_score.append(self.dataset['paws_qqp']['label'][i])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['paws_qqp']].tolist()
+        roc_auc = roc_auc_score(true_score, pred_score)
+        self.print_result_table({
+            'Dataset_name': 'paws_qqp',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc]
+        })
+    def evaluate_qqp(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['qqp']:
+            sent1.append(example['question1'])
+            sent2.append(example['question2'])
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['qqp']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'qqp',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc_score(true_score, pred_score)]
+        })
+    def evaluate_wmt17(self):
+        lang_pair = list(set([each['lang'] for each in self.dataset['wmt17']]))
+        for each_lang_pair in lang_pair:
+            true_score = []
+            premise = []
+            hypothesis = []
+            for example in self.dataset['wmt17']:
+                if example['lang'] != each_lang_pair:
+                    continue
+                premise.append(example['reference'])
+                hypothesis.append(example['candidate'])
+                true_score.append(example['score'])
+            pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['wmt17']].tolist()
+            self.print_result_table({
+                'Dataset_name': f'wmt17-{each_lang_pair}',
+                'Pearson': self.get_pearson(true_score, pred_score),
+                'Spearman': self.get_spearman(true_score, pred_score),
+                'Kendall': self.get_kendalltau(true_score, pred_score)
+            })
+    def evaluate_wmt18(self):
+        lang_pair = list(set([each['lang'] for each in self.dataset['wmt18']]))
+        for each_lang_pair in lang_pair:
+            true_score = []
+            premise = []
+            hypothesis = []
+            for example in self.dataset['wmt18']:
+                if example['lang'] != each_lang_pair:
+                    continue
+                premise.append(example['reference'])
+                hypothesis.append(example['candidate'])
+                true_score.append(example['score'])
+            pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['wmt18']].tolist()
+            self.print_result_table({
+                'Dataset_name': f'wmt18-{each_lang_pair}',
+                'Pearson': self.get_pearson(true_score, pred_score),
+                'Spearman': self.get_spearman(true_score, pred_score),
+                'Kendall': self.get_kendalltau(true_score, pred_score)
+            })
+    def evaluate_wmt19(self):
+        lang_pair = list(set([each['lang'] for each in self.dataset['wmt19']]))
+        for each_lang_pair in lang_pair:
+            true_score = []
+            premise = []
+            hypothesis = []
+            for example in self.dataset['wmt19']:
+                if example['lang'] != each_lang_pair:
+                    continue
+                premise.append(example['reference'])
+                hypothesis.append(example['candidate'])
+                true_score.append(example['score'])
+            pred_score = self.align_func(premise, hypothesis)[ALL_TASKS['wmt19']].tolist()
+            self.print_result_table({
+                'Dataset_name': f'wmt19-{each_lang_pair}',
+                'Pearson': self.get_pearson(true_score, pred_score),
+                'Spearman': self.get_spearman(true_score, pred_score),
+                'Kendall': self.get_kendalltau(true_score, pred_score)
+            })
+    def evaluate_sst2(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['sst2']:
+            sent1.append(example['text'])
+            sent2.append('It was great.')
+            true_score.append(int(example['label_text'] == 'positive'))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['sst2']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'sst2',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_cr(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['cr']:
+            sent1.append(example['text'])
+            sent2.append('It was great.')
+            true_score.append(int(example['label_text'] == 'positive'))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['cr']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'cr',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_subj(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['subj']:
+            sent1.append(example['text'])
+            sent2.append('It was objective.')
+            true_score.append(int(example['label_text'] == 'objective'))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['subj']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'subj',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_imdb(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['imdb']:
+            sent1.append(example['text'])
+            sent2.append('It was great.')
+            true_score.append(int(example['label_text'] == 'positive'))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['imdb']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'imdb',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_imdb_knn(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['imdb']:
+            sent1.append(example['text'])
+            sent2.append('It was great.')
+            true_score.append(int(example['label_text'] == 'positive'))
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['imdb']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'imdb',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_cola(self):
+        true_score = []
+        sent1 = []
+        sent2 = []
+        for example in self.dataset['cola']:
+            sent1.append(example['sentence'])
+            sent2.append('It was correct.')
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[ALL_TASKS['cola']].tolist()
+        self.print_result_table({
+            'Dataset_name': 'cola',
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': roc_auc_score(true_score, pred_score)
+        })
+    def evaluate_yelp_efl(self):
+        sent = []
+        label = []
+        for example in self.dataset['yelp_efl']:
+            sent.append(example['text'])
+            label.append(example['label'])
+        templates = [
+            'It was terrible.',
+            'It was bad.',
+            'It was ok.',
+            'It was good.',
+            'It was great.',
+        ]
+        template_lists = [[template] * len(sent) for template in templates]
+        predictions = [
+            self.align_func(sent, template_list)[ALL_TASKS['yelp_efl']]
+            for template_list in template_lists
+        ]
+        pred_label = torch.argmax(torch.stack(predictions), dim=0).tolist()
+        self.print_result_table({
+            'Dataset_name': 'yelp_efl',
+            'Accuracy': [accuracy_score(label, pred_label)]
+        })
+    def evaluate_ag_news(self):
+        sent = []
+        label = []
+        for example in self.dataset['ag_news']:
+            sent.append(example['text'])
+            label.append(example['label'])
+        templates = [
+            'It is world news.',
+            'It is sports news.',
+            'It is business news.',
+            'It is science news.',
+        ]
+        template_lists = [[template] * len(sent) for template in templates]
+        predictions = [
+            self.align_func(sent, template_list)[ALL_TASKS['ag_news']]
+            for template_list in template_lists
+        ]
+        pred_label = torch.argmax(torch.stack(predictions), dim=0).tolist()
+        self.print_result_table({
+            'Dataset_name': 'ag_news',
+            'Accuracy': [accuracy_score(label, pred_label)]
+        })
+    def evaluate_trec(self):
+        sent = []
+        label = []
+        for example in self.dataset['trec']:
+            sent.append(example['text'])
+            label.append(example['label_coarse'])
+        templates = [
+            'It is description.',
+            'It is entity.',
+            'It is expression.',
+            'It is human.',
+            'It is number.',
+            'It is location.',
+        ]
+        template_lists = [[template] * len(sent) for template in templates]
+        predictions = [
+            self.align_func(sent, template_list)[ALL_TASKS['trec']]
+            for template_list in template_lists
+        ]
+        pred_label = torch.argmax(torch.stack(predictions), dim=0).tolist()
+        self.print_result_table({
+            'Dataset_name': 'trec',
+            'Accuracy': [accuracy_score(label, pred_label)]
+        })
+    def true_task_helper(self, dataset_name):
+        sent1 = []
+        sent2 = []
+        true_score = []
+        for i in range(len(self.dataset[dataset_name])):
+            context = self.dataset[dataset_name].iloc[i]['grounding']
+            claim = self.dataset[dataset_name].iloc[i]['generated_text']
+            sent1.append(context)
+            sent2.append(self.clean_text(context, claim))
+            true_score.append(self.dataset[dataset_name].iloc[i]['label'])
+        pred_score = self.align_func(sent1, sent2)[1].tolist()
+        roc_auc = roc_auc_score(true_score, pred_score)
+        self.print_result_table({
+            'Dataset_name': dataset_name,
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'AUC': [roc_auc]
+        })
+    def evaluate_true_begin(self):
+        dataset_name = 'true_begin'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_dialfact(self):
+        dataset_name = 'true_dialfact'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_fever(self):
+        dataset_name = 'true_fever'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_frank(self):
+        dataset_name = 'true_frank'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_mnbm(self):
+        dataset_name = 'true_mnbm'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_paws(self):
+        dataset_name = 'true_paws'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_q2(self):
+        dataset_name = 'true_q2'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_qags_cnndm(self):
+        dataset_name = 'true_qags_cnndm'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_qags_xsum(self):
+        dataset_name = 'true_qags_xsum'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_summeval(self):
+        dataset_name = 'true_summeval'
+        self.true_task_helper(dataset_name)
+    def evaluate_true_vitc(self):
+        dataset_name = 'true_vitc'
+        self.true_task_helper(dataset_name)
+    def get_summac_thres(self, dataset_name):
+        sent1 = []
+        sent2 = []
+        true_score = []
+        for example in self.summac_validation_set[dataset_name]:
+            sent1.append(example['document'])
+            sent2.append(self.clean_text(example['document'], example['claim'])) #
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[1].tolist()
+        thres_result = []
+        for i in range(1001):
+            thres = i / 1000
+            thres_result.append((thres, balanced_accuracy_score(true_score, [p>thres for p in pred_score])))
+        best_thres = sorted(thres_result, key=lambda x: x[1], reverse=True)[0]
+        print(f"best thres for {dataset_name} is {best_thres[0]} @ {best_thres[1]}")
+        return best_thres[0]
+    def summac_task_helper(self, dataset_name):
+        sent1 = []
+        sent2 = []
+        true_score = []
+        for example in self.dataset[dataset_name]:
+            sent1.append(example['document'])
+            sent2.append(self.clean_text(example['document'], example['claim']))
+            true_score.append(example['label'])
+        pred_score = self.align_func(sent1, sent2)[1].tolist()
+        roc_auc = roc_auc_score(true_score, pred_score)
+        balanced_acc_thres = self.get_summac_thres(dataset_name)
+        self.print_result_table({
+            'Dataset_name': dataset_name,
+            'F1': self.get_f1(true_score, pred_score),
+            'Accuracy': self.get_accuracy(true_score, pred_score),
+            'BalancedAcc': self.get_balanced_accuracy(true_score, pred_score, thres=balanced_acc_thres),
+            'threshold': balanced_acc_thres,
+            'AUC': [roc_auc]
+        })
+    def evaluate_summac_cogensumm(self):
+        dataset_name = 'summac_cogensumm'
+        self.summac_task_helper(dataset_name)
+    def evaluate_summac_xsumfaith(self):
+        dataset_name = 'summac_xsumfaith'
+        self.summac_task_helper(dataset_name)
+    def evaluate_summac_polytope(self):
+        dataset_name = 'summac_polytope'
+        self.summac_task_helper(dataset_name)
+    def evaluate_summac_factcc(self):
+        dataset_name = 'summac_factcc'
+        self.summac_task_helper(dataset_name)
+    def evaluate_summac_summeval(self):
+        dataset_name = 'summac_summeval'
+        self.summac_task_helper(dataset_name)
+    def evaluate_summac_frank(self):
+        dataset_name = 'summac_frank'
+        self.summac_task_helper(dataset_name)
+    def evaluate_beir(self):
+        from beir import util, LoggingHandler
+        from beir.datasets.data_loader import GenericDataLoader
+        from beir.retrieval.evaluation import EvaluateRetrieval
+        from beir.retrieval.search.lexical import BM25Search as BM25
+        from beir.reranking.models import CrossEncoder
+        from beir.reranking import Rerank
+        import pathlib, os
+        import logging
+        import random
+        #### Just some code to print debug information to stdout
+        logging.basicConfig(format='%(asctime)s - %(message)s',
+                            datefmt='%Y-%m-%d %H:%M:%S',
+                            level=logging.INFO,
+                            handlers=[LoggingHandler()])
+        #### /print debug information to stdout
+        #### Download trec-covid.zip dataset and unzip the dataset
+        for beir_dataset_name in ['msmarco', 'trec-covid', 'nfcorpus', 'nq', 'hotpotqa', 'fiqa',
+                                  'arguana', 'webis-touche2020', 'cqadupstack', 'quora',
+                                  'dbpedia-entity', 'scidocs', 'fever', 'climate-fever', 'scifact']:
+        # for beir_dataset_name in ['fever']:
+            url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(beir_dataset_name)
+            # out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
+            out_dir = f"./data/eval/beir/{beir_dataset_name}/"
+            data_path = util.download_and_unzip(url, out_dir)
+            #### Provide the data path where trec-covid has been downloaded and unzipped to the data loader
+            # data folder would contain these files:
+            # (1) trec-covid/corpus.jsonl  (format: jsonlines)
+            # (2) trec-covid/queries.jsonl (format: jsonlines)
+            # (3) trec-covid/qrels/test.tsv (format: tsv ("\t"))
+            corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
+            #########################################
+            #### (1) RETRIEVE Top-100 docs using BM25
+            #########################################
+            #### Provide parameters for Elasticsearch
+            # print(corpus)
+            hostname = "localhost" #localhost
+            index_name = beir_dataset_name # trec-covid
+            initialize = True # False
+            model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
+            retriever = EvaluateRetrieval(model, k_values=[1,3,5,10,100,1000])
+            #### Retrieve dense results (format of results is identical to qrels)
+            results = retriever.retrieve(corpus, queries)
+            # Rerank top-100 results using the reranker provided
+            reranker = Rerank(self.align_func)
+            rerank_results = reranker.rerank(corpus, queries, results, top_k=100)
+            #### Evaluate your retrieval using NDCG@k, MAP@K ...
+            ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)
+            self.print_result_table({
+                'Dataset_name': beir_dataset_name,
+                'ndcg': ndcg,
+                'map': _map,
+                'recall': recall,
+                'precision': precision
+            })
+    def evaluate_xxx(self):
+        pass
+class evaluateMultiCheckpoints:
+    def __init__(self, config, device='cuda:0') -> None:
+        sample_checkpoint = {
+            'backbone': 'roberta-base',
+            'task_name': 'align-wo-finetune | align-finetune | roberta-finetune-baseline | nli-wo-finetune | nli-finetune',
+            'path': 'some path',
+            'result_save_path': 'some path'
+        }
+        self.config = config ## a dictionary
+        self.device = device
+        self.tasks = [
+                        'summeval', 'qags_xsum', 'qags_cnndm', 'persona_chat', 'topical_chat',
+                        'mnli_mismatched', 'mnli_matched',
+                        'sick', 'yelp', 'stsb',
+                        'anli_1','anli_2', 'anli_3', 'snli', 'vitaminc',
+                        'mrpc', 'paws', 'sem_eval', 'paws_qqp', 'qqp',
+                        'newsroom', 'rank19', 'bagel', 'race_m', 'race_h'
+                        ]
+    def experimentForSlide1216(self):
+        for ckpt in self.config:
+            self.evaluateOneCheckpoint(ckpt)
+    def evaluateOneCheckpoint(self, ckpt):
+        model_name = ckpt['path'].split('/')[-1].split('.ckpt')[0]
+        infer = Inferencer(ckpt_path=ckpt['path'],
+                        model=ckpt['backbone'], batch_size=32, device=self.device)
+        evaluator = Evaluator(eval_tasks=self.tasks, align_func=infer.inference, save_all_tables=True)
+        evaluator.result_save_name = f"{ckpt['result_save_path']}{model_name}"
+        evaluator.evaluate()

alignscore/generate_training_data.py ADDED Viewed

	@@ -0,0 +1,1519 @@

+from logging import error
+from datasets import load_dataset
+import transformers
+from random import sample
+import random
+import torch
+import json
+from tqdm import tqdm
+from nltk.translate.bleu_score import sentence_bleu
+import pandas as pd
+import re
+'''
+data format
+{text_a, text_b, label:None or 0_1, }
+'''
+DATASET_HUGGINGFACE = {
+    'cnndm': ['cnn_dailymail', '3.0.0', 'train'],
+    'mnli': ['multi_nli', 'default', 'train'],
+    'squad': ['squad', 'plain_text', 'train'],
+    'squad_v2': ['squad_v2', 'squad_v2', 'train'],
+    'paws': ['paws', 'labeled_final', 'train'],
+    'vitaminc': ['tals/vitaminc', 'v1.0', 'train'],
+    'xsum': ['xsum', 'default', 'train'],
+    'stsb': ['glue', 'stsb', 'train'],
+    'sick': ['sick', 'default', 'train'],
+    'race': ['race', 'all', 'train'],
+    'race_val': ['race', 'all', 'validation'],
+    'anli_r1': ['anli', 'plain_text', 'train_r1'],
+    'anli_r2': ['anli', 'plain_text', 'train_r2'],
+    'anli_r3': ['anli', 'plain_text', 'train_r3'],
+    'snli': ['snli', 'plain_text', 'train'],
+    'wikihow': ['wikihow', 'all', 'train'],
+    'mrpc': ['glue', 'mrpc', 'train'],
+    'msmarco': ['ms_marco', 'v2.1', 'train'],
+    'mrpc_val': ['glue', 'mrpc', 'validation'],
+    'paws_val': ['paws', 'labeled_final', 'validation'],
+    'paws_unlabeled': ['paws', 'unlabeled_final', 'train'],
+    'qqp': ['glue', 'qqp', 'train'],
+    'qqp_val': ['glue', 'qqp', 'validation'],
+    'squad_v2_new': ['squad_v2', 'squad_v2', 'train'],
+    'adversarial_qa': ['adversarial_qa', 'adversarialQA', 'train'],
+    'drop': ['drop', 'train'],
+    'duorc_self': ['duorc', 'SelfRC', 'train'],
+    'duorc_paraphrase': ['duorc', 'ParaphraseRC', 'train'],
+    'quoref': ['quoref', 'train'],
+    'hotpot_qa_distractor': ['hotpot_qa', 'distractor', 'train'],
+    'hotpot_qa_fullwiki': ['hotpot_qa', 'fullwiki', 'train'],
+    'ropes': ['ropes', 'train'],
+    'boolq': ['boolq', 'train'],
+    'eraser_multi_rc': ['eraser_multi_rc', 'train'],
+    'quail': ['quail', 'train'],
+    'sciq': ['sciq', 'train'],
+    'strategy_qa': ['metaeval/strategy-qa', 'train'],
+    'gap': ['gap', 'train'],
+}
+DATASET_CONFIG = {
+    'cnndm': {'task': 'summarization', 'text_a': 'article', 'text_b': 'highlights', 'label': None, 'huggingface': True},
+    'mnli': {'task': 'nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label', 'huggingface': True},
+    'nli_fever': {'task': 'fact_checking', 'text_a': 'context', 'text_b': 'query', 'label': 'label','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/nli_fever/train_fitems.jsonl' },
+    'doc_nli': {'task': 'bin_nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/DocNLI_dataset/train.json' },
+    'squad': {'task': 'extractive_qa', 'text_a': 'context', 'text_b': ['question', 'answers'], 'label': None, 'huggingface': True},
+    'squad_v2': {'task': 'qa', 'text_a': 'context', 'text_b': ['question', 'answers'], 'label': None, 'huggingface': True},
+    'paws': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label', 'huggingface': True},
+    'vitaminc': {'task': 'fact_checking', 'text_a': 'evidence', 'text_b': 'claim', 'label': 'label', 'huggingface': True},
+    'xsum': {'task': 'summarization', 'text_a': 'document', 'text_b': 'summary', 'label': None, 'huggingface': True, 'cliff_path': 'data/model_generated_data/cliff_summ/xsum_train.jsonl'},
+    'stsb': {'task': 'sts', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label', 'huggingface': True},
+    'sick': {'task': 'sts', 'text_a': 'sentence_A', 'text_b': 'sentence_B', 'label': 'relatedness_score', 'huggingface': True},
+    'race': {'task': 'qa', 'text_a': 'article', 'text_b': ['question', 'options'], 'label': 'answer', 'huggingface': True},
+    'race_val': {'task': 'qa', 'text_a': 'article', 'text_b': ['question', 'options'], 'label': 'answer', 'huggingface': True},
+    'anli_r1': {'task': 'nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label', 'huggingface': True},
+    'anli_r2': {'task': 'nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label', 'huggingface': True},
+    'anli_r3': {'task': 'nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label', 'huggingface': True},
+    'snli': {'task': 'nli', 'text_a': 'premise', 'text_b': 'hypothesis', 'label': 'label', 'huggingface': True},
+    'wikihow': {'task': 'summarization', 'text_a': 'text', 'text_b': 'headline', 'label': None, 'huggingface': False, 'using_hf_api': True, 'data_dir': 'data/wikihow_raw'},
+    'mrpc': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label','huggingface': True},
+    'mrpc_val': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label','huggingface': True},
+    'paws_val': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label', 'huggingface': True},
+    'paws_unlabeled': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': 'label', 'huggingface': True},
+    'msmarco': {'task': 'ir', 'text_a': 'passages', 'text_b': ['query', 'answers'], 'label': None,'huggingface': True},
+    'paws_qqp': {'task': 'paraphrase', 'text_a': 'sentence1', 'text_b': 'sentence2', 'label': None,'huggingface': False, 'using_hf_api': False, 'using_pandas': True, 'data_path':'paws_qqp/output/train.tsv' },
+    'wiki103': {'task': 'paraphrase', 'text_a': 'original_sent', 'text_b': 'paraphrase', 'label': None,'huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json': True, 'data_path':'data/model_generated_data/backtranslation/wiki103_single_sent_backtranslation.json'},
+    'qqp': {'task': 'paraphrase', 'text_a':'question1', 'text_b':'question2', 'label': 'label', 'huggingface': True},
+    'qqp_val': {'task': 'paraphrase', 'text_a':'question1', 'text_b':'question2', 'label': 'label', 'huggingface': True},
+    'wmt17xxx': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': True, 'data_path':'data/wmt/wmt17/2017-da.csv' },
+    'wmt15': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/eval/wmt15_eval.jsonl' },
+    'wmt16': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/eval/wmt16_eval.jsonl' },
+    'wmt17': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/eval/wmt17_eval.jsonl' },
+    'wmt18': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/eval/wmt18_eval.jsonl' },
+    'wmt19': {'task': 'wmt', 'text_a': 'ref', 'text_b': 'mt', 'label': 'score','huggingface': False, 'using_hf_api': False, 'using_pandas': False, 'using_json':True, 'data_path':'data/eval/wmt19_eval.jsonl' },
+    'squad_v2_new': {'task': 'qa', 'huggingface': True},
+    'adversarial_qa': {'task': 'qa', 'huggingface': True},
+    'drop': {'task': 'qa', 'huggingface': True},
+    'duorc_self': {'task': 'qa', 'huggingface': True},
+    'duorc_paraphrase': {'task': 'qa', 'huggingface': True},
+    'quoref': {'task': 'qa', 'huggingface': True},
+    'hotpot_qa_distractor': {'task': 'qa', 'huggingface': True},
+    'hotpot_qa_fullwiki': {'task': 'qa', 'huggingface': True},
+    'newsqa': {'task': 'qa',  'using_json': True, 'raw_json': True, 'data_path': 'data/newsqa_raw/combined-newsqa-data-v1.json'},
+    'ropes': {'task': 'qa', 'huggingface': True},
+    'boolq': {'task': 'qa', 'huggingface': True},
+    'eraser_multi_rc': {'task': 'qa', 'huggingface': True},
+    'quail': {'task': 'qa', 'huggingface': True},
+    'sciq': {'task': 'qa', 'huggingface': True},
+    'strategy_qa': {'task': 'qa', 'huggingface': True},
+    'gap': {'task': 'coreference', 'huggingface': True},
+}
+class QA2D():
+    def __init__(self, batch_size=32, device='cuda', verbose=True) -> None:
+        from transformers import BartTokenizer, BartForConditionalGeneration
+        self.tokenizer = BartTokenizer.from_pretrained("MarkS/bart-base-qa2d")
+        self.model = BartForConditionalGeneration.from_pretrained("MarkS/bart-base-qa2d").to(device)
+        self.batch_size = batch_size
+        self.device=device
+        self.verbose = verbose
+    def generate(self, questions: list, answers: list):
+        assert len(questions) == len(answers)
+        qa_list = []
+        for q, a in zip(questions, answers):
+            qa_list.append(f"question: {q} answer: {a}")
+        output = []
+        for qa_pairs in tqdm(
+            self.chunks(qa_list, self.batch_size),
+            desc="QA to Declarative",
+            total=int(len(qa_list)/self.batch_size),
+            disable=(not self.verbose)
+        ):
+            input_text = qa_pairs
+            input_token = self.tokenizer(
+                input_text, return_tensors='pt', padding=True, truncation=True).to(self.device)
+            dec_sents = self.model.generate(
+                input_token.input_ids, max_length=512)
+            result = self.tokenizer.batch_decode(
+                dec_sents, skip_special_tokens=True)
+            output.extend(result)
+        return output
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+class QAnswering():
+    """
+    To answer not-answerable questions
+    """
+    def __init__(self, batch_size=32, device='cuda') -> None:
+        from transformers import T5Tokenizer, T5ForConditionalGeneration
+        self.tokenizer = T5Tokenizer.from_pretrained(
+            "valhalla/t5-base-qa-qg-hl")
+        self.model = T5ForConditionalGeneration.from_pretrained(
+            "valhalla/t5-base-qa-qg-hl").to(device)
+        self.batch_size = batch_size
+        self.device = device
+    def generate(self, questions: list, contexts: list):
+        assert len(questions) == len(contexts)
+        answers = []
+        for qs, cs in tqdm(zip(self.chunks(questions, self.batch_size), self.chunks(contexts, self.batch_size)), desc="Generating Answers for not answerable", total=int(len(questions)/self.batch_size)):
+            qc_pairs = []
+            assert len(qs) == len(cs)
+            for one_q, one_c in zip(qs, cs):
+                qc_pairs.append(f"""question: {one_q} context: {one_c}""")
+            input_ids = self.tokenizer(
+                qc_pairs, padding=True, truncation=True, return_tensors='pt').to(self.device).input_ids
+            outputs = self.model.generate(input_ids, max_length=512)
+            answers.extend(self.tokenizer.batch_decode(
+                outputs, skip_special_tokens=True))
+        return answers
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+class MLMGeneratorWithPairedData():
+    def __init__(self, corpra: list, device='cuda', batch_size=8, mask_percent=0.25) -> None:
+        self.device = device
+        self.tokenizer = transformers.DistilBertTokenizer.from_pretrained(
+            "distilbert-base-uncased")
+        self.model = transformers.DistilBertForMaskedLM.from_pretrained(
+            "distilbert-base-uncased").to(self.device)
+        self.mask_percent = mask_percent
+        self.batch_size = batch_size
+        self.dataset = corpra  # text needs to be noised
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+    def generate(self):
+        sents_output = []
+        for examples in tqdm(self.chunks(self.dataset, self.batch_size), total=int(len(self.dataset)/self.batch_size), desc="MLM Generating"):
+            sents_to_be_noised = [each for each in examples]
+            sents_noised = self.mlm_infiller(sents_to_be_noised)
+            sents_output.extend(sents_noised)
+        return sents_output
+    def mlm_infiller(self, batch):
+        """
+        input a batch of sentences, list
+        """
+        masked_batch = []
+        masked_batch_ids = []
+        for each_sent in batch:
+            sent_tokens = self.tokenizer.tokenize(each_sent)
+            sent_token_ids = self.tokenizer(each_sent)['input_ids']
+            mask_list = sample(list(range(len(sent_tokens))), int(
+                self.mask_percent * len(sent_tokens)))
+            sent_tokens = [
+                each if i not in mask_list else self.tokenizer.mask_token for i, each in enumerate(sent_tokens)]
+            masked_batch_ids.append(
+                [each if i-1 not in mask_list else self.tokenizer.mask_token_id for i, each in enumerate(sent_token_ids)])
+            masked_batch.append(' '.join(sent_tokens))
+        inputs = self.tokenizer(
+            masked_batch, padding=True, truncation=True, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            logits = self.model(**inputs).logits
+        infill_tokens = []
+        for i in range(len(masked_batch)):
+            mask_token_index = (inputs.input_ids == self.tokenizer.mask_token_id)[
+                i].nonzero(as_tuple=True)[0]
+            predicted_token_id = logits[i, mask_token_index].argmax(axis=-1)
+            infill_tokens.append(predicted_token_id)
+        infilled_sent = []
+        for masked_sent_ids, infill_token in zip(masked_batch_ids, infill_tokens):
+            for infill_one_token in infill_token:
+                for i, each_id in enumerate(masked_sent_ids):
+                    if each_id == self.tokenizer.mask_token_id:
+                        masked_sent_ids[i] = infill_one_token
+                        break
+            infilled_sent.append(self.tokenizer.decode(
+                masked_sent_ids, skip_special_tokens=True))
+        return infilled_sent
+class ExtractiveSummarizationGenerator():
+    def __init__(self) -> None:
+        pass
+    def generate(self, texts):
+        '''
+        texts: list of string
+        '''
+        from summa.summarizer import summarize
+        summaries = []
+        for text in tqdm(texts, desc="Extracting Summary"):
+            for prop in range(1, 20):
+                summ = summarize(text, ratio=prop/20.)
+                if len(summ) > 0:
+                    break
+            summaries.append(summ)
+        return summaries
+class DataGenerator():
+    def __init__(self, dataset_names) -> None:
+        self.dataset_names = dataset_names
+        self.datasets = dict()
+        self.t5_qa = None
+        self.t5_tokenizer = None
+        self.load_dataset_from_huggingface()
+    def load_dataset_from_huggingface(self):
+        for each_dataset in self.dataset_names:
+            if DATASET_CONFIG[each_dataset].get('huggingface'):
+                self.datasets[each_dataset] = load_dataset(
+                    *DATASET_HUGGINGFACE[each_dataset][:-1])[DATASET_HUGGINGFACE[each_dataset][-1]]
+            elif DATASET_CONFIG[each_dataset].get('using_hf_api'):
+                self.datasets[each_dataset] = load_dataset(
+                    *DATASET_HUGGINGFACE[each_dataset][:-1], data_dir=DATASET_CONFIG[each_dataset]['data_dir'])[DATASET_HUGGINGFACE[each_dataset][-1]]
+            elif DATASET_CONFIG[each_dataset].get('using_pandas'):
+                if DATASET_CONFIG[each_dataset]['data_path'].split('.')[-1] == 'tsv':
+                    self.datasets[each_dataset] = pd.read_csv(
+                        DATASET_CONFIG[each_dataset]['data_path'], sep='\t')
+                elif DATASET_CONFIG[each_dataset]['data_path'].split('.')[-1] == 'csv':
+                    self.datasets[each_dataset] = pd.read_csv(
+                        DATASET_CONFIG[each_dataset]['data_path'])
+            elif DATASET_CONFIG[each_dataset].get('using_json'):
+                self.datasets[each_dataset] = []
+                if DATASET_CONFIG[each_dataset].get('raw_json'):
+                    with open(DATASET_CONFIG[each_dataset]['data_path'], 'r', encoding='utf8') as f:
+                        self.datasets[each_dataset] = json.load(f)
+                else:
+                    try:
+                        json_file = json.load(
+                            open(DATASET_CONFIG[each_dataset]['data_path'], 'r', encoding='utf8'))
+                        for example in json_file:
+                            self.datasets[each_dataset].append(example)
+                    except:
+                        with open(DATASET_CONFIG[each_dataset]['data_path'], 'r', encoding='utf8') as f:
+                            for example in f:
+                                self.datasets[each_dataset].append(
+                                    json.loads(example))
+            else:
+                error('unable to locate raw dataset...')
+    def process_squad(self):
+        from rake_nltk import Rake
+        r = Rake()
+        topk = 5
+        threshold = 0.6
+        output = []
+        label = -1
+        for example in tqdm(self.datasets['squad'], desc=f'Constructing squad'):
+            text_a = example[DATASET_CONFIG['squad']['text_a']]
+            question = example[DATASET_CONFIG['squad']['text_b'][0]]
+            answer = example[DATASET_CONFIG['squad']
+                             ['text_b'][1]]['text']  # a list
+            text_b = [question+' '+answer_ele for answer_ele in answer]
+            text_c = []
+            r.extract_keywords_from_text(text_a)
+            keywords_in_context = r.get_ranked_phrases()[:topk]
+            for each_keyword in keywords_in_context:
+                # then it is an incorrect answer
+                if sentence_bleu([answer_ele.lower().split() for answer_ele in answer], each_keyword.split(), weights=(0.33, 0.33, 0.33)) < threshold:
+                    text_c.append(question+' '+each_keyword)
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_squad_v2(self):
+        # first collect answerable items
+        not_answerable_contexts = []
+        not_answerable_questions = []
+        not_answerable_answers = []
+        answerable_contexts = []
+        answerable_questions = []
+        answerable_answers = []
+        qa_generator = QAnswering(batch_size=32, device='cuda')
+        qa2d_generator = QA2D(batch_size=32, device='cuda')
+        for example in tqdm(self.datasets['squad_v2'], desc=f'Collecting (not)answerable examples'):
+            if len(example['answers']['text']) == 0:
+                not_answerable_contexts.append(example['context'])
+                not_answerable_questions.append(example['question'])
+            else:
+                answerable_contexts.append(example['context'])
+                answerable_questions.append(example['question'])
+                answerable_answers.append(example['answers']['text'][0])
+        not_answerable_answers = qa_generator.generate(
+            not_answerable_questions, not_answerable_contexts)
+        answerable_declarative_sents = qa2d_generator.generate(
+            answerable_questions, answerable_answers)
+        not_answerable_declarative_sents = qa2d_generator.generate(
+            not_answerable_questions, not_answerable_answers)
+        output = []
+        for i, dec_sent in enumerate(answerable_declarative_sents):
+            output.append({
+                'text_a': answerable_contexts[i],
+                'text_b': [dec_sent],
+                'text_c': [],
+                'label': 1
+            })
+        for i, dec_sent in enumerate(not_answerable_declarative_sents):
+            output.append({
+                'text_a': not_answerable_contexts[i],
+                'text_b': [dec_sent],
+                'text_c': [],
+                'label': 0
+            })
+        return output
+    def process_race(self):
+        qa2d_generator = QA2D(batch_size=32, device='cuda')
+        option_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
+        output = []
+        correct_context = []
+        correct_question = []
+        correct_answer = []
+        wrong_context = []
+        wrong_question = []
+        wrong_answer = []
+        for example in tqdm(self.datasets['race'], desc=f'Constructing race'):
+            text_a = example[DATASET_CONFIG['race']['text_a']]
+            label = -1
+            question = example[DATASET_CONFIG['race']['text_b'][0]]
+            if "_" in question:
+                answer_id = option_dict[example[DATASET_CONFIG['race']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race']['text_b'][1]]):
+                    if i == answer_id:
+                        output.append({
+                            'text_a': text_a,
+                            'text_b': [' '.join(question.replace("_", " "+options+" ").split())],
+                            'text_c': [],
+                            'label': 1
+                        })
+                    else:
+                        output.append({
+                            'text_a': text_a,
+                            'text_b': [' '.join(question.replace("_", " "+options+" ").split())],
+                            'text_c': [],
+                            'label': 0
+                        })
+            else:
+                answer_id = option_dict[example[DATASET_CONFIG['race']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race']['text_b'][1]]):
+                    if i == answer_id:
+                        output.append({
+                                'text_a': text_a,
+                                'text_b': [question],
+                                'text_c': [options],
+                                'label': 1
+                            })
+                    else:
+                        output.append({
+                                'text_a': text_a,
+                                'text_b': [question],
+                                'text_c': [options],
+                                'label': 0
+                            })
+        return output
+    def process_race_val(self):
+        qa2d_generator = QA2D(batch_size=32, device='cuda')
+        option_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
+        output = []
+        correct_context = []
+        correct_question = []
+        correct_answer = []
+        wrong_context = []
+        wrong_question = []
+        wrong_answer = []
+        for example in tqdm(self.datasets['race_val'], desc=f'Constructing race_val'):
+            text_a = example[DATASET_CONFIG['race_val']['text_a']]
+            label = -1
+            question = example[DATASET_CONFIG['race_val']['text_b'][0]]
+            if "_" in question:
+                answer_id = option_dict[example[DATASET_CONFIG['race_val']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race_val']['text_b'][1]]):
+                    if i == answer_id:
+                        output.append({
+                            'text_a': text_a,
+                            'text_b': [' '.join(question.replace("_", " "+options+" ").split())],
+                            'text_c': [],
+                            'label': 1
+                        })
+                    else:
+                        output.append({
+                            'text_a': text_a,
+                            'text_b': [' '.join(question.replace("_", " "+options+" ").split())],
+                            'text_c': [],
+                            'label': 0
+                        })
+            else:
+                answer_id = option_dict[example[DATASET_CONFIG['race_val']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race_val']['text_b'][1]]):
+                    if i == answer_id:
+                        correct_context.append(text_a)
+                        correct_question.append(question)
+                        correct_answer.append(options)
+                    else:
+                        wrong_context.append(text_a)
+                        wrong_question.append(question)
+                        wrong_answer.append(options)
+        correct_declarative = qa2d_generator.generate(
+            correct_question, correct_answer)
+        wrong_declarative = qa2d_generator.generate(
+            wrong_question, wrong_answer)
+        assert len(correct_context) == len(correct_declarative)
+        assert len(wrong_context) == len(wrong_declarative)
+        for context, dec in zip(correct_context, correct_declarative):
+            output.append({
+                'text_a': context,
+                'text_b': [dec],
+                'text_c': [],
+                'label': 1
+            })
+        for context, dec in zip(wrong_context, wrong_declarative):
+            output.append({
+                'text_a': context,
+                'text_b': [dec],
+                'text_c': [],
+                'label': 0
+            })
+        return output
+    def process_race_test(self):
+        option_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
+        output = []
+        for example in tqdm(self.datasets['race_test'], desc=f'Constructing race_test'):
+            text_a = example[DATASET_CONFIG['race_test']['text_a']]
+            text_b = []  # pos
+            text_c = []  # neg
+            label = -1
+            question = example[DATASET_CONFIG['race_test']['text_b'][0]]
+            if "_" in question:
+                answer_id = option_dict[example[DATASET_CONFIG['race_test']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race_test']['text_b'][1]]):
+                    if i == answer_id:
+                        text_b.append(' '.join(question.replace(
+                            "_", " "+options+" ").split()))
+                    else:
+                        text_c.append(' '.join(question.replace(
+                            "_", " "+options+" ").split()))
+            else:
+                answer_id = option_dict[example[DATASET_CONFIG['race_test']['label']]]
+                for i, options in enumerate(example[DATASET_CONFIG['race_test']['text_b'][1]]):
+                    if i == answer_id:
+                        text_b.append(question+" "+options+" ")
+                    else:
+                        text_c.append(question+" "+options+" ")
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_xsum(self):
+        '''
+        text_a: raw_text
+        text_b: raw_summary + ***extractive summ*** removed
+        text_c: cliff xsum + DistillBERT from raw_text_b + ***DistillBERT from extractive summ text_b***
+        '''
+        output = []
+        gold_summary = [example[DATASET_CONFIG['xsum']['text_b']]
+                        for example in self.datasets['xsum']]
+        ext_summarizer = ExtractiveSummarizationGenerator()
+        extracted_summ = ext_summarizer.generate(
+            [example[DATASET_CONFIG['xsum']['text_a']] for example in self.datasets['xsum']])
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=gold_summary, device='cuda:0', batch_size=64, mask_percent=0.25)
+        gold_summary_hallucinated = mlm_hallucinator.generate()
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=extracted_summ, device='cuda:0', batch_size=64, mask_percent=0.25)
+        extracted_summ_hallucinated = mlm_hallucinator.generate()
+        assert len(self.datasets['xsum']) == len(gold_summary_hallucinated) and len(
+            self.datasets['xsum']) == len(extracted_summ_hallucinated)
+        for i, example in tqdm(enumerate(self.datasets['xsum']), desc="Constructing xsum", total=len(self.datasets['xsum'])):
+            text_a = example[DATASET_CONFIG['xsum']['text_a']]
+            text_b = [gold_summary[i], extracted_summ[i]]
+            text_c = [gold_summary_hallucinated[i],
+                      extracted_summ_hallucinated[i]]
+            label = -1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_cnndm(self):
+        '''
+        text_a: raw_text
+        text_b: raw_summary + ***extractive summ*** removed
+        text_c: DistillBERT from raw_text_b + ***DistillBERT from extractive summ text_b***
+        '''
+        # interpretation of fairseq-generate output: https://github.com/facebookresearch/fairseq/issues/3000
+        output = []
+        gold_summary = [example[DATASET_CONFIG['cnndm']['text_b']]
+                        for example in self.datasets['cnndm']]
+        ext_summarizer = ExtractiveSummarizationGenerator()
+        extracted_summ = ext_summarizer.generate(
+            [example[DATASET_CONFIG['cnndm']['text_a']] for example in self.datasets['cnndm']])
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=gold_summary, device='cuda:0', batch_size=64, mask_percent=0.25)
+        gold_summary_hallucinated = mlm_hallucinator.generate()
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=extracted_summ, device='cuda:0', batch_size=64, mask_percent=0.25)
+        extracted_summ_hallucinated = mlm_hallucinator.generate()
+        assert len(self.datasets['cnndm']) == len(gold_summary_hallucinated) and len(
+            self.datasets['cnndm']) == len(extracted_summ_hallucinated)
+        for i, example in tqdm(enumerate(self.datasets['cnndm']), desc="Constructing cnndm", total=len(self.datasets['cnndm'])):
+            text_a = example[DATASET_CONFIG['cnndm']['text_a']]
+            text_b = [gold_summary[i], extracted_summ[i]]
+            text_c = [gold_summary_hallucinated[i],
+                      extracted_summ_hallucinated[i]]
+            label = -1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wikihow(self):
+        '''
+        text_a: raw_text
+        text_b: raw_summary + ***extractive summ*** removed
+        text_c: DistillBERT from raw_text_b + ***DistillBERT from extractive summ text_b***
+        '''
+        # interpretation of fairseq-generate output: https://github.com/facebookresearch/fairseq/issues/3000
+        output = []
+        gold_summary = [example[DATASET_CONFIG['wikihow']['text_b']]
+                        for example in self.datasets['wikihow']]
+        ext_summarizer = ExtractiveSummarizationGenerator()
+        extracted_summ = ext_summarizer.generate(
+            [example[DATASET_CONFIG['wikihow']['text_a']] for example in self.datasets['wikihow']])
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=gold_summary, device='cuda:0', batch_size=64, mask_percent=0.25)
+        gold_summary_hallucinated = mlm_hallucinator.generate()
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=extracted_summ, device='cuda:0', batch_size=64, mask_percent=0.25)
+        extracted_summ_hallucinated = mlm_hallucinator.generate()
+        assert len(self.datasets['wikihow']) == len(gold_summary_hallucinated) and len(
+            self.datasets['wikihow']) == len(extracted_summ_hallucinated)
+        for i, example in tqdm(enumerate(self.datasets['wikihow']), desc="Constructing wikihow", total=len(self.datasets['wikihow'])):
+            text_a = example[DATASET_CONFIG['wikihow']['text_a']]
+            text_b = [gold_summary[i], extracted_summ[i]]
+            text_c = [gold_summary_hallucinated[i],
+                      extracted_summ_hallucinated[i]]
+            label = -1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wiki103(self):
+        output = []
+        paraphrases = [example[DATASET_CONFIG['wiki103']['text_b']]
+                       for example in self.datasets['wiki103']]
+        mlm_hallucinator = MLMGeneratorWithPairedData(
+            corpra=paraphrases, device='cuda:3', batch_size=64, mask_percent=0.25)
+        paraphrase_hallucinated = mlm_hallucinator.generate()
+        assert len(self.datasets['wiki103']) == len(paraphrase_hallucinated)
+        for i, example in tqdm(enumerate(self.datasets['wiki103']), desc=f'Constructing wiki103'):
+            output.append({
+                'text_a': example[DATASET_CONFIG['wiki103']['text_a']],
+                'text_b': [example[DATASET_CONFIG['wiki103']['text_b']]],
+                'text_c': [],
+                'label': 1
+            })
+            output.append({
+                'text_a': example[DATASET_CONFIG['wiki103']['text_a']],
+                'text_b': [paraphrase_hallucinated[i]],
+                'text_c': [],
+                'label': 0
+            })
+        return output
+    def process_mnli(self):
+        output = []
+        for example in tqdm(self.datasets['mnli'], desc=f'Constructing mnli'):
+            text_a = example[DATASET_CONFIG['mnli']['text_a']]
+            text_b = [example[DATASET_CONFIG['mnli']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['mnli']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_nli_fever(self):
+        output = []
+        for example in tqdm(self.datasets['nli_fever'], desc=f'Constructing nli_fever'):
+            text_a = example[DATASET_CONFIG['nli_fever']['text_a']]
+            text_b = [example[DATASET_CONFIG['nli_fever']['text_b']]]
+            text_c = []
+            raw_label = example[DATASET_CONFIG['nli_fever']['label']]
+            if raw_label == 'SUPPORTS':  # convert to nli style label
+                label = 0
+            elif raw_label == 'REFUTES':
+                label = 2
+            else:
+                label = 1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_doc_nli(self):
+        output = []
+        for example in tqdm(self.datasets['doc_nli'], desc=f'Constructing doc_nli'):
+            text_a = example[DATASET_CONFIG['doc_nli']['text_a']]
+            text_b = [example[DATASET_CONFIG['doc_nli']['text_b']]]
+            text_c = []
+            raw_label = example[DATASET_CONFIG['doc_nli']['label']]
+            if raw_label == 'entailment':  # convert to paraphrase style label
+                label = 1
+            else:
+                label = 0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_anli_r1(self):
+        output = []
+        for example in tqdm(self.datasets['anli_r1'], desc=f'Constructing anli_r1'):
+            text_a = example[DATASET_CONFIG['anli_r1']['text_a']]
+            text_b = [example[DATASET_CONFIG['anli_r1']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['anli_r1']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_anli_r2(self):
+        output = []
+        for example in tqdm(self.datasets['anli_r2'], desc=f'Constructing anli_r2'):
+            text_a = example[DATASET_CONFIG['anli_r2']['text_a']]
+            text_b = [example[DATASET_CONFIG['anli_r2']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['anli_r2']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_anli_r3(self):
+        output = []
+        for example in tqdm(self.datasets['anli_r3'], desc=f'Constructing anli_r3'):
+            text_a = example[DATASET_CONFIG['anli_r3']['text_a']]
+            text_b = [example[DATASET_CONFIG['anli_r3']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['anli_r3']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_snli(self):
+        output = []
+        for example in tqdm(self.datasets['snli'], desc=f'Constructing snli'):
+            text_a = example[DATASET_CONFIG['snli']['text_a']]
+            text_b = [example[DATASET_CONFIG['snli']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['snli']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_paws(self):
+        output = []
+        for example in tqdm(self.datasets['paws'], desc=f'Constructing paws'):
+            text_a = example[DATASET_CONFIG['paws']['text_a']]
+            text_b = [example[DATASET_CONFIG['paws']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['paws']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_vitaminc(self):
+        output = []
+        for example in tqdm(self.datasets['vitaminc'], desc=f'Constructing vitaminc'):
+            text_a = example[DATASET_CONFIG['vitaminc']['text_a']]
+            text_b = [example[DATASET_CONFIG['vitaminc']['text_b']]]
+            text_c = []
+            raw_label = example[DATASET_CONFIG['vitaminc']['label']]
+            if raw_label == 'SUPPORTS':  # convert to nli style label
+                label = 0
+            elif raw_label == 'REFUTES':
+                label = 2
+            else:
+                label = 1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_stsb(self):
+        output = []
+        for example in tqdm(self.datasets['stsb'], desc=f'Constructing stsb'):
+            text_a = example[DATASET_CONFIG['stsb']['text_a']]
+            text_b = [example[DATASET_CONFIG['stsb']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['stsb']['label']] / 5.0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_sick(self):
+        output = []
+        for example in tqdm(self.datasets['sick'], desc=f'Constructing sick'):
+            text_a = example[DATASET_CONFIG['sick']['text_a']]
+            text_b = [example[DATASET_CONFIG['sick']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['sick']['label']] / 5.0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_mrpc(self):
+        output = []
+        for example in tqdm(self.datasets['mrpc'], desc=f'Constructing mrpc'):
+            text_a = example[DATASET_CONFIG['mrpc']['text_a']]
+            text_b = [example[DATASET_CONFIG['mrpc']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['mrpc']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_mrpc_val(self):
+        output = []
+        for example in tqdm(self.datasets['mrpc_val'], desc=f'Constructing mrpc_val'):
+            text_a = example[DATASET_CONFIG['mrpc_val']['text_a']]
+            text_b = [example[DATASET_CONFIG['mrpc_val']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['mrpc_val']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_paws_val(self):
+        output = []
+        for example in tqdm(self.datasets['paws_val'], desc=f'Constructing paws_val'):
+            text_a = example[DATASET_CONFIG['paws_val']['text_a']]
+            text_b = [example[DATASET_CONFIG['paws_val']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['paws_val']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_paws_unlabeled(self):
+        output = []
+        for example in tqdm(self.datasets['paws_unlabeled'], desc=f'Constructing paws_unlabeled'):
+            text_a = example[DATASET_CONFIG['paws_unlabeled']['text_a']]
+            text_b = [example[DATASET_CONFIG['paws_unlabeled']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['paws_unlabeled']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_qqp(self):
+        output = []
+        for example in tqdm(self.datasets['qqp'], desc=f'Constructing qqp'):
+            text_a = example[DATASET_CONFIG['qqp']['text_a']]
+            text_b = [example[DATASET_CONFIG['qqp']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['qqp']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_qqp_val(self):
+        output = []
+        for example in tqdm(self.datasets['qqp_val'], desc=f'Constructing qqp_val'):
+            text_a = example[DATASET_CONFIG['qqp_val']['text_a']]
+            text_b = [example[DATASET_CONFIG['qqp_val']['text_b']]]
+            text_c = []
+            label = example[DATASET_CONFIG['qqp_val']['label']]
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_msmarco(self):
+        qa2d_generator = QA2D(batch_size=32, device='cuda')
+        output = []
+        correct_contexts = []
+        correct_questions = []
+        correct_answers = []
+        wrong_contexts = []
+        wrong_questions = []
+        wrong_answers = []
+        filtered_examples = []
+        questions = []
+        answers = []
+        declaratives = []
+        for example in tqdm(self.datasets['msmarco'], desc=f'Collecting msmarco'):
+            if sum(example['passages']['is_selected']) > 0:  # has answer
+                questions.append(example['query'])
+                answers.append(example['answers'][0] if len(
+                    example['wellFormedAnswers']) == 0 else example['wellFormedAnswers'][0])
+                filtered_examples.append(example)
+        for example in filtered_examples:
+            for i, is_selected in enumerate(example['passages']['is_selected']):
+                if is_selected == 1:
+                    output.append({
+                        'text_a': example['passages']['passage_text'][i],
+                        'text_b': [example['query']],
+                        'text_c': [],
+                        'label': 1
+                    }
+                    )
+                else:
+                    output.append({
+                        'text_a': example['passages']['passage_text'][i],
+                        'text_b': [example['query']],
+                        'text_c': [],
+                        'label': 0
+                    }
+                    )
+        return output
+    def process_paws_qqp(self):
+        output = []
+        for i in range(len(self.datasets['paws_qqp'])):
+            text_a = self.datasets['paws_qqp'].iloc[i]['sentence1'][2:-1]
+            text_b = [self.datasets['paws_qqp'].iloc[i]['sentence2'][2:-1]]
+            text_c = []
+            label = self.datasets['paws_qqp'].iloc[i]['label']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': int(label)
+            })
+        return output
+    def process_wmt15(self):
+        output = []
+        for example in self.datasets['wmt15']:
+            text_a = example['reference']
+            text_b = [example['candidate']]
+            text_c = []
+            label = example['score']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wmt16(self):
+        output = []
+        for example in self.datasets['wmt16']:
+            text_a = example['reference']
+            text_b = [example['candidate']]
+            text_c = []
+            label = example['score']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wmt17(self):
+        output = []
+        for example in self.datasets['wmt17']:
+            text_a = example['reference']
+            text_b = [example['candidate']]
+            text_c = []
+            label = example['score']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wmt18(self):
+        output = []
+        for example in self.datasets['wmt18']:
+            text_a = example['reference']
+            text_b = [example['candidate']]
+            text_c = []
+            label = example['score']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_wmt19(self):
+        output = []
+        for example in self.datasets['wmt19']:
+            text_a = example['reference']
+            text_b = [example['candidate']]
+            text_c = []
+            label = example['score']
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_boolq(self):
+        output = []
+        for example in self.datasets['boolq']:
+            text_a = example['passage']
+            text_b = [example['question']]
+            text_c = ["Yes." if example['answer'] else "No."]
+            label = 1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+            text_a = example['passage']
+            text_b = [example['question']]
+            text_c = ["Yes." if not example['answer'] else "No."]
+            label = 0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_eraser_multi_rc(self):
+        output = []
+        for example in self.datasets['eraser_multi_rc']:
+            text_a = example['passage']
+            text_b = [example['query_and_answer'].replace("|", "")]
+            text_c = []
+            label = int(example['label'])
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_quail(self):
+        output = []
+        for example in self.datasets['quail']:
+            for i, ans in enumerate(example['answers']):
+                text_a = example['context']
+                text_b = [example['question']]
+                text_c = [ans]
+                label = 1 if i == example['correct_answer_id'] else 0
+                output.append({
+                    'text_a': text_a,
+                    'text_b': text_b,
+                    'text_c': text_c,
+                    'label': label
+                })
+        return output
+    def process_sciq(self):
+        output = []
+        for example in self.datasets['sciq']:
+            text_a = example['support']
+            output.append({
+                'text_a': text_a,
+                'text_b': [example['question']],
+                'text_c': [example['distractor1']],
+                'label': 0
+            })
+            output.append({
+                'text_a': text_a,
+                'text_b': [example['question']],
+                'text_c': [example['distractor2']],
+                'label': 0
+            })
+            output.append({
+                'text_a': text_a,
+                'text_b': [example['question']],
+                'text_c': [example['distractor3']],
+                'label': 0
+            })
+            output.append({
+                'text_a': text_a,
+                'text_b': [example['question']],
+                'text_c': [example['correct_answer']],
+                'label': 1
+            })
+        return output
+    def process_strategy_qa(self):
+        output = []
+        for example in self.datasets['strategy_qa']:
+            text_a = ' '.join(example['facts'])
+            text_b = [example['question']]
+            text_c = ["Yes." if example['answer'] else "No."]
+            label = 1
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+            text_a = ' '.join(example['facts'])
+            text_b = [example['question']]
+            text_c = ["Yes." if not example['answer'] else "No."]
+            label = 0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def process_gap(self):
+        output = []
+        for example in self.datasets['gap']:
+            text_a = example['Text']
+            text_b = [example['Text'][:example['Pronoun-offset']]+example['A']+example['Text'][(example['Pronoun-offset']+len(example['Pronoun'])):]]
+            text_c = []
+            label = 1 if example['A-coref'] else 0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+            text_a = example['Text']
+            text_b = [example['Text'][:example['Pronoun-offset']]+example['B']+example['Text'][(example['Pronoun-offset']+len(example['Pronoun'])):]]
+            text_c = []
+            label = 1 if example['B-coref'] else 0
+            output.append({
+                'text_a': text_a,
+                'text_b': text_b,
+                'text_c': text_c,
+                'label': label
+            })
+        return output
+    def init_qa_t5(self):
+        from transformers import T5Tokenizer, T5ForConditionalGeneration
+        if self.t5_qa is None:
+            self.t5_tokenizer = T5Tokenizer.from_pretrained(
+                "t5-base", model_max_length=800)
+            self.t5_qa = T5ForConditionalGeneration.from_pretrained("t5-base")
+            self.t5_qa.to('cuda:1')
+            self.t5_qa.eval()
+    @staticmethod
+    def mask_answer(context, answers):
+        answers = sorted(answers, key=len, reverse=True)
+        for answer in answers:
+            pattern = f'(?<![\w\\-\u2013]){re.escape(answer)}(?![\w\\-\u2013])'
+            context = re.sub(pattern, '', context, flags=re.IGNORECASE)
+        return context
+    def generate_fake_answer(self, context, question, answers):
+        self.init_qa_t5()
+        context_no_answer = self.mask_answer(context, answers)
+        input_ids = self.t5_tokenizer(
+            f'question: {question} context: {context_no_answer}',
+            return_tensors="pt",
+            truncation='only_first'
+        ).input_ids.to(self.t5_qa.device)
+        outputs = self.t5_qa.generate(
+            input_ids,
+            max_new_tokens=40,
+            remove_invalid_values=True
+        )
+        return self.t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
+    def negative_sample_qa(self, samples, negative_sample_no_ans_only=True):
+        outputs = []
+        for context, question, answers in samples:
+            if answers:
+                outputs.append({
+                    'text_a': context,
+                    'text_b': [question],
+                    'text_c': answers,
+                    'label': 1
+                })
+            if not answers or not negative_sample_no_ans_only:
+                fake_answer = self.generate_fake_answer(
+                    context, question, answers)
+                outputs.append({
+                    'text_a': context,
+                    'text_b': [question],
+                    'text_c': [fake_answer],
+                    'label': 0
+                })
+        return outputs
+    def process_squad_v2_new(self):
+        samples = (
+            (sample['context'], sample['question'], sample['answers']['text'])
+            for sample in tqdm(self.datasets['squad_v2_new'], desc=f'squad_v2_new')
+        )
+        return self.negative_sample_qa(samples)
+    def process_adversarial_qa(self):
+        samples = (
+            (sample['context'], sample['question'], sample['answers']['text'])
+            for sample in tqdm(self.datasets['adversarial_qa'], desc=f'adversarial_qa')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_drop(self):
+        samples = (
+            (sample['passage'], sample['question'],
+             sample['answers_spans']['spans'])
+            for sample in tqdm(self.datasets['drop'], desc=f'drop')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_duorc_self(self):
+        samples = (
+            (sample['plot'], sample['question'],
+             sample['answers'])
+            for sample in tqdm(self.datasets['duorc_self'], desc=f'duorc_self')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_duorc_paraphrase(self):
+        samples = (
+            (sample['plot'], sample['question'],
+             sample['answers'])
+            for sample in tqdm(self.datasets['duorc_paraphrase'], desc=f'duorc_paraphrase')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_quoref(self):
+        samples = (
+            (sample['context'], sample['question'], sample['answers']['text'])
+            for sample in tqdm(self.datasets['quoref'], desc=f'quoref')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    @staticmethod
+    def prepare_hotpot_qa_samples(dateset):
+        for sample in dateset:
+            question = sample['question']
+            answer = sample['answer']
+            supporting_docs = set(sample['supporting_facts']['title'])
+            irrelevant_docs = []
+            context_paragraphs = []
+            for title, setences in zip(sample['context']['title'], sample['context']['sentences']):
+                doc = ''.join(setences)
+                if title in supporting_docs:
+                    context_paragraphs.append(doc)
+                else:
+                    irrelevant_docs.append(doc)
+            # Add some irrelevant documents
+            if irrelevant_docs and len(context_paragraphs) < 4:
+                context_paragraphs.append(random.choice(irrelevant_docs))
+            random.shuffle(context_paragraphs)
+            yield '\n'.join(context_paragraphs), question, [answer]
+    def process_hotpot_qa_distractor(self):
+        samples = self.prepare_hotpot_qa_samples(
+            tqdm(self.datasets['hotpot_qa_distractor'],
+                 desc=f'hotpot_qa_distractor')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_hotpot_qa_fullwiki(self):
+        samples = self.prepare_hotpot_qa_samples(
+            tqdm(self.datasets['hotpot_qa_fullwiki'],
+                 desc=f'hotpot_qa_fullwiki')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_newsqa(self):
+        def get_samples(dataset):
+            for story in tqdm(dataset['data'], desc='newsqa'):
+                if story['type'] != 'train':
+                    continue
+                context = story['text']
+                for question in story['questions']:
+                    if question.get('isQuestionBad', 0.) > 0.2:
+                        continue
+                    answers = []
+                    if 's' in question['consensus']:
+                        start = question['consensus']['s']
+                        end = question['consensus']['e']
+                        answers.append(context[start:end].strip())
+                    yield context, question['q'], answers
+        samples = get_samples(self.datasets['newsqa'])
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def process_ropes(self):
+        samples = (
+            (
+                sample['situation'] + ' ' + sample['background'],
+                sample['question'], sample['answers']['text']
+            )
+            for sample in tqdm(self.datasets['ropes'], desc=f'ropes')
+        )
+        return self.negative_sample_qa(samples, negative_sample_no_ans_only=False)
+    def generate(self):
+        for each_dataset in self.datasets:
+            with open(f'./data/training/{each_dataset}.json', 'w', encoding='utf8') as outfile:
+                outfile.write("")
+        for each_dataset in self.datasets:
+            outputs = eval(f'self.process_{each_dataset}()')
+            for each_output in outputs:
+                dict_write_to_file = {
+                    'task': DATASET_CONFIG[each_dataset]['task'],
+                    'text_a': each_output['text_a'],  # string
+                    # list of positive examples
+                    'text_b': each_output['text_b'],
+                    # list of negative examples
+                    'text_c': each_output['text_c'],
+                    # original label, if -1 only has positive pairs and negative pairs
+                    'orig_label': each_output['label']
+                }
+                with open(f'./data/training/{each_dataset}.json', 'a', encoding='utf8') as outfile:
+                    json.dump(dict_write_to_file, outfile, ensure_ascii=False)
+                    outfile.write('\n')
+if __name__ == "__main__":
+    random.seed(42)
+    gen = DataGenerator(list(DATASET_CONFIG.keys()))
+    gen.generate()

alignscore/pyproject.toml ADDED Viewed

	@@ -0,0 +1,41 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "alignscore"
+version = "0.1.3"
+authors = [
+    { name = "Yuheng Zha", email = "yzha@ucsd.edu" },
+    { name = "Yichi Yang", email = "yiy067@ucsd.edu" },
+    { name = "Ruichen Li", email = "rul014@ucsd.edu" },
+    { name = "Zhiting Hu", email = "zhh019@ucsd.edu" },
+]
+description = "An automatic factual consistency evaluation metric based on a unifined alignment function"
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "spacy>=3.4.0,<4",
+    "nltk>=3.7,<4",
+    "torch>=1.12.1,<2",
+    "transformers>=4.20.1,<5",
+    "tqdm>=4.64.0,<5",
+    "jsonlines>=2.0.0,<3",
+    "numpy>=1.23.1,<2",
+    "datasets>=2.3.2,<3",
+    "scikit-learn>=1.1.2,<2",
+    "pytorch_lightning>=1.7.7,<2",
+    "scipy>=1.8.1,<2",
+    "tensorboard>=2.12.0,<3",
+    "protobuf<=3.20"
+]
+[project.urls]
+"Homepage" = "https://github.com/yuh-zha/AlignScore"
+"Bug Tracker" = "https://github.com/yuh-zha/AlignScore/issues"

alignscore/requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+alignscore>=0.1
+ctc_score==0.1.3
+BLEURT @ git+https://github.com/google-research/bleurt@cebe7e6f996b40910cfaa520a63db47807e3bf5c
+bert_score==0.3.11
+rake_nltk==1.0.6
+summa==1.2.0
+benepar==0.2.0
+summac==0.0.3
+tabulate>=0.9.0,<1

alignscore/src/alignscore/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .alignscore import AlignScore

alignscore/src/alignscore/alignscore.py ADDED Viewed

	@@ -0,0 +1,16 @@

+from .inference import Inferencer
+from typing import List
+class AlignScore:
+    def __init__(self, model: str, batch_size: int, device: int, ckpt_path: str, evaluation_mode='nli_sp', verbose=True) -> None:
+        self.model = Inferencer(
+            ckpt_path=ckpt_path,
+            model=model,
+            batch_size=batch_size,
+            device=device,
+            verbose=verbose
+        )
+        self.model.nlg_eval_mode = evaluation_mode
+    def score(self, contexts: List[str], claims: List[str]) -> List[float]:
+        return self.model.nlg_eval(contexts, claims)[1].tolist()

alignscore/src/alignscore/dataloader.py ADDED Viewed

	@@ -0,0 +1,610 @@

+import json
+import logging
+import random
+from typing import Optional, Sized
+import numpy as np
+import torch
+from pytorch_lightning import LightningDataModule
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+)
+from torch.utils.data import Dataset, Sampler
+import os
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+class DSTDataSet(Dataset):
+    def __init__(self, dataset, model_name='bert-base-uncased', need_mlm=True, tokenizer_max_length=512) -> None:
+        super().__init__()
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.tokenizer_max_length = tokenizer_max_length
+        self.config = AutoConfig.from_pretrained(model_name)
+        self.dataset_type_dict = dict()
+        self.dataset = dataset
+        self.need_mlm = need_mlm
+        self.dataset_type_dict_init()
+    def dataset_type_dict_init(self):
+        for i, example in enumerate(self.dataset):
+            try:
+                self.dataset_type_dict[example['task']].append(i)
+            except:
+                self.dataset_type_dict[example['task']] = [i]
+    def random_word(self, tokens):
+        """
+        Masking some random tokens for Language Model task with probabilities as in the original BERT paper.
+        :param tokens: list of str, tokenized sentence.
+        :param tokenizer: Tokenizer, object used for tokenization (we need it's vocab here)
+        :return: (list of str, list of int), masked tokens and related labels for LM prediction
+        """
+        if not self.need_mlm: # disable masked language modeling
+            return tokens, [-100] * len(tokens)
+        output_label = []
+        for i, token in enumerate(tokens):
+            if token == self.tokenizer.pad_token_id:
+                output_label.append(-100) # PAD tokens ignore
+                continue
+            prob = random.random()
+            # mask token with 15% probability
+            if prob < 0.15:
+                prob /= 0.15
+                # 80% randomly change token to mask token
+                if prob < 0.8:
+                    tokens[i] = self.tokenizer.mask_token_id
+                # 10% randomly change token to random token
+                elif prob < 0.9:
+                    tokens[i] = random.choice(list(range(self.tokenizer.vocab_size)))
+                # -> rest 10% randomly keep current token
+                # append current token to output (we will predict these later)
+                output_label.append(token)
+            else:
+                # no masking token (will be ignored by loss function later)
+                output_label.append(-100)
+        return tokens, output_label
+    def process_nli(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        tri_label = self.dataset[index]['orig_label'] if self.dataset[index]['orig_label'] != -1 else 1
+        rand_self_align = random.random()
+        if rand_self_align > 0.95: ### random self alignment
+            text_b = self.dataset[index]['text_a']
+            tri_label = 0
+        elif self.dataset[index]['orig_label'] == 2 and random.random() > 0.95:
+            text_a = self.dataset[index]['text_b'][0]
+            text_b = self.dataset[index]['text_a']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(-100), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(tri_label), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_paraphrase(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        label = self.dataset[index]['orig_label']
+        rand_self_align = random.random()
+        if rand_self_align > 0.95: ### random self alignment
+            text_b = self.dataset[index]['text_a']
+            label = 1
+        elif random.random() > 0.95:
+            text_a = self.dataset[index]['text_b'][0]
+            text_b = self.dataset[index]['text_a']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_qa(self, index):
+        text_a = self.dataset[index]['text_a']
+        if len(self.dataset[index]['text_c']) > 0:
+            text_b = self.dataset[index]['text_b'][0] + ' ' + self.dataset[index]['text_c'][0]
+        else:
+            text_b = self.dataset[index]['text_b'][0]
+        label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_coreference(self, index):
+        text_a = self.dataset[index]['text_a']
+        if len(self.dataset[index]['text_c']) > 0:
+            text_b = self.dataset[index]['text_b'][0] + ' ' + self.dataset[index]['text_c'][0]
+        else:
+            text_b = self.dataset[index]['text_b'][0]
+        label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_bin_nli(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_fact_checking(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        tri_label = self.dataset[index]['orig_label'] if self.dataset[index]['orig_label'] != -1 else 1
+        rand_self_align = random.random()
+        if rand_self_align > 0.95: ### random self alignment
+            text_b = self.dataset[index]['text_a']
+            tri_label = 0
+        elif self.dataset[index]['orig_label'] == 2 and random.random() > 0.95:
+            text_a = self.dataset[index]['text_b'][0]
+            text_b = self.dataset[index]['text_a']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(-100), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(tri_label), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_summarization(self, index):
+        text_a = self.dataset[index]['text_a']
+        if random.random() > 0.5: # this will be a positive pair
+            random_pos_sample_id = random.randint(0, len(self.dataset[index]['text_b'])-1)
+            text_b = self.dataset[index]['text_b'][random_pos_sample_id]
+            label = 1
+        else: # this will be a negative pair
+            label = 0
+            if len(self.dataset[index]['text_c']) > 0:
+                random_neg_sample_id = random.randint(0, len(self.dataset[index]['text_c'])-1)
+                text_b = self.dataset[index]['text_c'][random_neg_sample_id]
+            else:
+                random_choose_from_entire_dataset_text_b = random.choice(self.dataset_type_dict['summarization'])
+                text_b = self.dataset[random_choose_from_entire_dataset_text_b]['text_b'][0]
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_multiple_choice_qa(self, index):
+        text_a = self.dataset[index]['text_a']
+        if random.random() > 0.5: # this will be a positive pair
+            text_b = self.dataset[index]['text_b'][0]
+            label = 1
+        else: # this will be a negative pair
+            label = 0
+            if len(self.dataset[index]['text_c']) > 0:
+                random_neg_sample_id = random.randint(0, len(self.dataset[index]['text_c'])-1)
+                text_b = self.dataset[index]['text_c'][random_neg_sample_id]
+            else:
+                random_choose_from_entire_dataset_text_b = random.choice(self.dataset_type_dict['multiple_choice_qa'])
+                text_b = self.dataset[random_choose_from_entire_dataset_text_b]['text_b'][0]
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_extractive_qa(self, index):
+        text_a = self.dataset[index]['text_a']
+        if random.random() > 0.5: # this will be a positive pair
+            random_pos_sample_id = random.randint(0, len(self.dataset[index]['text_b'])-1)
+            text_b = self.dataset[index]['text_b'][random_pos_sample_id]
+            label = 1
+        else: # this will be a negative pair
+            label = 0
+            if len(self.dataset[index]['text_c']) > 0:
+                random_neg_sample_id = random.randint(0, len(self.dataset[index]['text_c'])-1)
+                text_b = self.dataset[index]['text_c'][random_neg_sample_id]
+            else:
+                random_choose_from_entire_dataset_text_b = random.choice(self.dataset_type_dict['extractive_qa'])
+                text_b = self.dataset[random_choose_from_entire_dataset_text_b]['text_b'][0]
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_ir(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][random.randint(0, len(self.dataset[index]['text_b'])-1)]
+        label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(label), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(-100.0) # reg label, float
+        )
+    def process_wmt(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        reg_label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(-100), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(reg_label) # reg label, float
+        )
+    def process_sts(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        reg_label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(-100), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(reg_label) # reg label, float
+        )
+    def process_ctc(self, index):
+        text_a = self.dataset[index]['text_a']
+        text_b = self.dataset[index]['text_b'][0]
+        reg_label = self.dataset[index]['orig_label']
+        try:
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation='only_first')
+        except:
+            logging.warning('text_b too long...')
+            tokenized_pair = self.tokenizer(text_a, text_b, padding='max_length', max_length=self.tokenizer_max_length, truncation=True)
+        input_ids, mlm_labels = self.random_word(tokenized_pair['input_ids'])
+        return (
+            torch.tensor(input_ids),
+            torch.tensor(tokenized_pair['attention_mask']),
+            torch.tensor(tokenized_pair['token_type_ids']) if 'token_type_ids' in tokenized_pair.keys() else None,
+            torch.tensor(-100), # align label, 2 class
+            torch.tensor(mlm_labels), # mlm label
+            torch.tensor(-100), # tri label, 3 class
+            torch.tensor(reg_label) # reg label, float
+        )
+    def __getitem__(self, index):
+        if self.dataset[index]['task'] == 'nli':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_nli(index)
+        if self.dataset[index]['task'] == 'bin_nli':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_bin_nli(index)
+        if self.dataset[index]['task'] == 'paraphrase':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_paraphrase(index)
+        if self.dataset[index]['task'] == 'fact_checking':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_fact_checking(index)
+        if self.dataset[index]['task'] == 'summarization':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_summarization(index)
+        if self.dataset[index]['task'] == 'multiple_choice_qa':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_multiple_choice_qa(index)
+        if self.dataset[index]['task'] == 'extractive_qa':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_extractive_qa(index)
+        if self.dataset[index]['task'] == 'qa':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_qa(index)
+        if self.dataset[index]['task'] == 'coreference':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_coreference(index)
+        if self.dataset[index]['task'] == 'ir':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_ir(index)
+        if self.dataset[index]['task'] == 'sts':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_sts(index)
+        if self.dataset[index]['task'] == 'ctc':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_ctc(index)
+        if self.dataset[index]['task'] == 'wmt':
+            input_ids, attention_mask, token_type_ids, align_label, mlm_labels, tri_label, reg_label = self.process_wmt(index)
+        if token_type_ids is not None:
+            return {
+                'input_ids': input_ids,
+                'attention_mask': attention_mask,
+                'token_type_ids': token_type_ids,
+                'align_label': align_label,
+                'mlm_label': mlm_labels,
+                'tri_label': tri_label,
+                'reg_label': reg_label
+            }
+        else:
+            return {
+                'input_ids': input_ids,
+                'attention_mask': attention_mask,
+                'align_label': align_label,
+                'mlm_label': mlm_labels,
+                'tri_label': tri_label,
+                'reg_label': reg_label
+            }
+    def __len__(self):
+        return len(self.dataset)
+class PropSampler(Sampler[int]):
+    def __init__(self, data_source: Optional[Sized]) -> None:
+        super().__init__(data_source)
+        self.K = 500000
+        print("Initializing Prop Sampler")
+        self.data_positions = dict()
+        for i, example in tqdm(enumerate(data_source), desc="Initializing Sampler"):
+            if example['dataset_name'] in self.data_positions.keys():
+                self.data_positions[example['dataset_name']].append(i)
+            else:
+                self.data_positions[example['dataset_name']] = [i]
+        self.all_dataset_names = list(self.data_positions.keys())
+        self.dataset_lengths = {each:len(self.data_positions[each]) for each in self.data_positions}
+        self.dataset_props = {each: min(self.dataset_lengths[each], self.K) for each in self.dataset_lengths}
+        self.dataset_props_sum = sum([self.dataset_props[each] for each in self.dataset_props])
+        print("Finish Prop Sampler initialization.")
+    def __iter__(self):
+        iter_list = []
+        for each in self.dataset_props:
+            iter_list.extend(np.random.choice(self.data_positions[each], size=self.dataset_props[each], replace=False).tolist())
+        random.shuffle(iter_list)
+        yield from iter_list
+    def __len__(self):
+        return self.dataset_props_sum
+class DSTDataLoader(LightningDataModule):
+    def __init__(self,dataset_config, val_dataset_config=None, sample_mode='seq', model_name='bert-base-uncased', is_finetune=False, need_mlm=True, tokenizer_max_length=512, train_batch_size=32, eval_batch_size=4, num_workers=16, train_eval_split=0.95, **kwargs):
+        super().__init__(**kwargs)
+        assert sample_mode in ['seq', 'proportion']
+        self.sample_mode = sample_mode
+        self.dataset_config = dataset_config
+        self.val_dataset_config = val_dataset_config
+        self.num_workers = num_workers
+        self.train_eval_split = train_eval_split
+        self.tokenizer_max_length = tokenizer_max_length
+        self.model_name = model_name
+        self.need_mlm = need_mlm
+        self.is_finetune = is_finetune
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.config = AutoConfig.from_pretrained(model_name)
+        self.train_bach_size = train_batch_size
+        self.eval_batch_size = eval_batch_size
+        self.dataset = None
+    def setup(self, stage: Optional[str] = None) -> None:
+        if self.dataset is not None:
+            print("Already Initilized LightningDataModule!")
+            return
+        self.init_training_set()
+        self.dataset = dict()
+        if not self.is_finetune:
+            self.dataset['train'] = DSTDataSet(dataset=self.raw_dataset[:int(self.train_eval_split*len(self.raw_dataset))], model_name=self.model_name, need_mlm=self.need_mlm)
+            self.dataset['test'] = DSTDataSet(dataset=self.raw_dataset[int(self.train_eval_split*len(self.raw_dataset)):], model_name=self.model_name, need_mlm=self.need_mlm)
+        else:
+            self.dataset['train'] = DSTDataSet(dataset=self.raw_dataset[:], model_name=self.model_name, need_mlm=self.need_mlm)
+            self.dataset['test'] = DSTDataSet(dataset=self.val_raw_dataset[:], model_name=self.model_name, need_mlm=self.need_mlm)
+    def init_training_set(self):
+        self.raw_dataset = []
+        if self.sample_mode == 'seq':
+            for each_dataset in self.dataset_config:
+                dataset_length = sum([1 for line in open(self.dataset_config[each_dataset]['data_path'], 'r', encoding='utf8')])
+                dataset_length_limit = self.dataset_config[each_dataset]['size'] if isinstance(self.dataset_config[each_dataset]['size'], int) else int(self.dataset_config[each_dataset]['size'] * dataset_length)
+                with open(self.dataset_config[each_dataset]['data_path'], 'r', encoding='utf8') as f:
+                    try:
+                        for i, example in enumerate(f):
+                            if i >= dataset_length_limit:
+                                break
+                            self.raw_dataset.append(json.loads(example)) ## + dataset_name
+                    except:
+                        print(f"failed to load data from {each_dataset}.json, exiting...")
+                        exit()
+            random.shuffle(self.raw_dataset)
+        elif self.sample_mode == 'proportion':
+            for each_dataset in tqdm(self.dataset_config, desc="Loading data from disk..."):
+                with open(self.dataset_config[each_dataset]['data_path'], 'r', encoding='utf8') as f:
+                    try:
+                        for i, example in enumerate(f):
+                            jsonobj = json.loads(example)
+                            jsonobj['dataset_name'] = each_dataset
+                            self.raw_dataset.append(jsonobj) ## + dataset_name
+                    except:
+                        print(f"failed to load data from {each_dataset}.json, exiting...")
+                        exit()
+            random.shuffle(self.raw_dataset)
+        if self.is_finetune:
+            self.val_raw_dataset = []
+            for each_dataset in self.val_dataset_config:
+                dataset_length = sum([1 for line in open(self.val_dataset_config[each_dataset]['data_path'], 'r', encoding='utf8')])
+                dataset_length_limit = self.val_dataset_config[each_dataset]['size'] if isinstance(self.val_dataset_config[each_dataset]['size'], int) else int(self.val_dataset_config[each_dataset]['size'] * dataset_length)
+                with open(self.val_dataset_config[each_dataset]['data_path'], 'r', encoding='utf8') as f:
+                    for i, example in enumerate(f):
+                        if i >= dataset_length_limit:
+                            break
+                        self.val_raw_dataset.append(json.loads(example))
+            random.shuffle(self.val_raw_dataset)
+    def prepare_data(self) -> None:
+        AutoTokenizer.from_pretrained(self.model_name)
+    def train_dataloader(self):
+        if self.sample_mode == 'seq':
+            return DataLoader(self.dataset['train'], batch_size=self.train_bach_size, shuffle=True, num_workers=self.num_workers)
+        elif self.sample_mode == 'proportion':
+            return DataLoader(self.dataset['train'], batch_size=self.train_bach_size, sampler=PropSampler(self.raw_dataset[:int(self.train_eval_split*len(self.raw_dataset))]), num_workers=self.num_workers)
+    def val_dataloader(self):
+        return DataLoader(self.dataset['test'], batch_size=self.eval_batch_size, shuffle=False, num_workers=self.num_workers)

alignscore/src/alignscore/inference.py ADDED Viewed

	@@ -0,0 +1,293 @@

+from logging import warning
+import spacy
+from nltk.tokenize import sent_tokenize
+import torch
+from .model import BERTAlignModel
+from transformers import AutoConfig, AutoTokenizer
+import torch.nn as nn
+from tqdm import tqdm
+class Inferencer():
+    def __init__(self, ckpt_path, model='bert-base-uncased', batch_size=32, device='cuda', verbose=True) -> None:
+        self.device = device
+        if ckpt_path is not None:
+            self.model = BERTAlignModel(model=model).load_from_checkpoint(checkpoint_path=ckpt_path, strict=False).to(self.device)
+        else:
+            warning('loading UNTRAINED model!')
+            self.model = BERTAlignModel(model=model).to(self.device)
+        self.model.eval()
+        self.batch_size = batch_size
+        self.config = AutoConfig.from_pretrained(model)
+        self.tokenizer = AutoTokenizer.from_pretrained(model)
+        self.spacy = spacy.load('en_core_web_sm')
+        self.loss_fct = nn.CrossEntropyLoss(reduction='none')
+        self.softmax = nn.Softmax(dim=-1)
+        self.smart_type = 'smart-n'
+        self.smart_n_metric = 'f1'
+        self.disable_progress_bar_in_inference = False
+        self.nlg_eval_mode = None # bin, bin_sp, nli, nli_sp
+        self.verbose = verbose
+    def inference_example_batch(self, premise: list, hypo: list):
+        """
+        inference a example,
+        premise: list
+        hypo: list
+        using self.inference to batch the process
+        SummaC Style aggregation
+        """
+        self.disable_progress_bar_in_inference = True
+        assert len(premise) == len(hypo), "Premise must has the same length with Hypothesis!"
+        out_score = []
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), desc="Evaluating", total=len(premise), disable=(not self.verbose)):
+            out_score.append(self.inference_per_example(one_pre, one_hypo))
+        return None, torch.tensor(out_score), None
+    def inference_per_example(self, premise:str, hypo: str):
+        """
+        inference a example,
+        premise: string
+        hypo: string
+        using self.inference to batch the process
+        """
+        def chunks(lst, n):
+            """Yield successive n-sized chunks from lst."""
+            for i in range(0, len(lst), n):
+                yield ' '.join(lst[i:i + n])
+        premise_sents = sent_tokenize(premise)
+        premise_sents = premise_sents or ['']
+        n_chunk = len(premise.strip().split()) // 350 + 1
+        n_chunk = max(len(premise_sents) // n_chunk, 1)
+        premise_sents = [each for each in chunks(premise_sents, n_chunk)]
+        hypo_sents = sent_tokenize(hypo)
+        premise_sent_mat = []
+        hypo_sents_mat = []
+        for i in range(len(premise_sents)):
+            for j in range(len(hypo_sents)):
+                premise_sent_mat.append(premise_sents[i])
+                hypo_sents_mat.append(hypo_sents[j])
+        if self.nlg_eval_mode is not None:
+            if self.nlg_eval_mode == 'nli_sp':
+                output_score = self.inference(premise_sent_mat, hypo_sents_mat)[2][:,0] ### use NLI head OR ALIGN head
+            elif self.nlg_eval_mode == 'bin_sp':
+                output_score = self.inference(premise_sent_mat, hypo_sents_mat)[1] ### use NLI head OR ALIGN head
+            elif self.nlg_eval_mode == 'reg_sp':
+                output_score = self.inference(premise_sent_mat, hypo_sents_mat)[0] ### use NLI head OR ALIGN head
+            output_score = output_score.view(len(premise_sents), len(hypo_sents)).max(dim=0).values.mean().item() ### sum or mean depends on the task/aspect
+            return output_score
+        output_score = self.inference(premise_sent_mat, hypo_sents_mat)[2][:,0] ### use NLI head OR ALIGN head
+        output_score = output_score.view(len(premise_sents), len(hypo_sents)).max(dim=0).values.mean().item() ### sum or mean depends on the task/aspect
+        return output_score
+    def inference(self, premise, hypo):
+        """
+        inference a list of premise and hypo
+        Standard aggregation
+        """
+        if isinstance(premise, str) and isinstance(hypo, str):
+            premise = [premise]
+            hypo = [hypo]
+        batch = self.batch_tokenize(premise, hypo)
+        output_score_reg = []
+        output_score_bin = []
+        output_score_tri = []
+        for mini_batch in tqdm(batch, desc="Evaluating", disable=not self.verbose or self.disable_progress_bar_in_inference):
+            mini_batch = mini_batch.to(self.device)
+            with torch.no_grad():
+                model_output = self.model(mini_batch)
+                model_output_reg = model_output.reg_label_logits.cpu()
+                model_output_bin = model_output.seq_relationship_logits # Temperature Scaling / 2.5
+                model_output_tri = model_output.tri_label_logits
+                model_output_bin = self.softmax(model_output_bin).cpu()
+                model_output_tri = self.softmax(model_output_tri).cpu()
+            output_score_reg.append(model_output_reg[:,0])
+            output_score_bin.append(model_output_bin[:,1])
+            output_score_tri.append(model_output_tri[:,:])
+        output_score_reg = torch.cat(output_score_reg)
+        output_score_bin = torch.cat(output_score_bin)
+        output_score_tri = torch.cat(output_score_tri)
+        if self.nlg_eval_mode is not None:
+            if self.nlg_eval_mode == 'nli':
+                output_score_nli = output_score_tri[:,0]
+                return None, output_score_nli, None
+            elif self.nlg_eval_mode == 'bin':
+                return None, output_score_bin, None
+            elif self.nlg_eval_mode == 'reg':
+                return None, output_score_reg, None
+            else:
+                ValueError("unrecognized nlg eval mode")
+        return output_score_reg, output_score_bin, output_score_tri
+    def inference_reg(self, premise, hypo):
+        """
+        inference a list of premise and hypo
+        Standard aggregation
+        """
+        self.model.is_reg_finetune = True
+        if isinstance(premise, str) and isinstance(hypo, str):
+            premise = [premise]
+            hypo = [hypo]
+        batch = self.batch_tokenize(premise, hypo)
+        output_score = []
+        for mini_batch in tqdm(batch, desc="Evaluating", disable=self.disable_progress_bar_in_inference):
+            mini_batch = mini_batch.to(self.device)
+            with torch.no_grad():
+                model_output = self.model(mini_batch).seq_relationship_logits.cpu().view(-1)
+            output_score.append(model_output)
+        output_score = torch.cat(output_score)
+        return output_score
+    def batch_tokenize(self, premise, hypo):
+        """
+        input premise and hypos are lists
+        """
+        assert isinstance(premise, list) and isinstance(hypo, list)
+        assert len(premise) == len(hypo), "premise and hypo should be in the same length."
+        batch = []
+        for mini_batch_pre, mini_batch_hypo in zip(self.chunks(premise, self.batch_size), self.chunks(hypo, self.batch_size)):
+            try:
+                mini_batch = self.tokenizer(mini_batch_pre, mini_batch_hypo, truncation='only_first', padding='max_length', max_length=self.tokenizer.model_max_length, return_tensors='pt')
+            except:
+                warning('text_b too long...')
+                mini_batch = self.tokenizer(mini_batch_pre, mini_batch_hypo, truncation=True, padding='max_length', max_length=self.tokenizer.model_max_length, return_tensors='pt')
+            batch.append(mini_batch)
+        return batch
+    def smart_doc(self, premise: list, hypo: list):
+        """
+        inference a example,
+        premise: list
+        hypo: list
+        using self.inference to batch the process
+        SMART Style aggregation
+        """
+        self.disable_progress_bar_in_inference = True
+        assert len(premise) == len(hypo), "Premise must has the same length with Hypothesis!"
+        assert self.smart_type in ['smart-n', 'smart-l']
+        out_score = []
+        for one_pre, one_hypo in tqdm(zip(premise, hypo), desc="Evaluating SMART", total=len(premise)):
+            out_score.append(self.smart_l(one_pre, one_hypo)[1] if self.smart_type == 'smart-l' else self.smart_n(one_pre, one_hypo)[1])
+        return None, torch.tensor(out_score), None
+    def smart_l(self, premise, hypo):
+        premise_sents = [each.text for each in self.spacy(premise).sents]
+        hypo_sents = [each.text for each in self.spacy(hypo).sents]
+        premise_sent_mat = []
+        hypo_sents_mat = []
+        for i in range(len(premise_sents)):
+            for j in range(len(hypo_sents)):
+                premise_sent_mat.append(premise_sents[i])
+                hypo_sents_mat.append(hypo_sents[j])
+        output_score = self.inference(premise_sent_mat, hypo_sents_mat)[2][:,0]
+        output_score = output_score.view(len(premise_sents), len(hypo_sents))
+        ### smart-l
+        lcs = [[0] * (len(hypo_sents)+1)] * (len(premise_sents)+1)
+        for i in range(len(premise_sents)+1):
+            for j in range(len(hypo_sents)+1):
+                if i != 0 and j != 0:
+                    m = output_score[i-1, j-1]
+                    lcs[i][j] = max([lcs[i-1][j-1]+m,
+                                        lcs[i-1][j]+m,
+                                        lcs[i][j-1]])
+        return None, lcs[-1][-1] / len(premise_sents), None
+    def smart_n(self, premise, hypo):
+        ### smart-n
+        n_gram = 1
+        premise_sents = [each.text for each in self.spacy(premise).sents]
+        hypo_sents = [each.text for each in self.spacy(hypo).sents]
+        premise_sent_mat = []
+        hypo_sents_mat = []
+        for i in range(len(premise_sents)):
+            for j in range(len(hypo_sents)):
+                premise_sent_mat.append(premise_sents[i])
+                hypo_sents_mat.append(hypo_sents[j])
+        output_score = self.inference(premise_sent_mat, hypo_sents_mat)[2][:,0]
+        output_score = output_score.view(len(premise_sents), len(hypo_sents))
+        prec = sum([max([sum([output_score[i+n, j+n]/n_gram for n in range(0, n_gram)]) for i in range(len(premise_sents)-n_gram+1)]) for j in range(len(hypo_sents)-n_gram+1)])
+        prec = prec / (len(hypo_sents) - n_gram + 1) if (len(hypo_sents) - n_gram + 1) > 0 else 0.
+        premise_sents = [each.text for each in self.spacy(hypo).sents]# simple change
+        hypo_sents = [each.text for each in self.spacy(premise).sents]#
+        premise_sent_mat = []
+        hypo_sents_mat = []
+        for i in range(len(premise_sents)):
+            for j in range(len(hypo_sents)):
+                premise_sent_mat.append(premise_sents[i])
+                hypo_sents_mat.append(hypo_sents[j])
+        output_score = self.inference(premise_sent_mat, hypo_sents_mat)[2][:,0]
+        output_score = output_score.view(len(premise_sents), len(hypo_sents))
+        recall = sum([max([sum([output_score[i+n, j+n]/n_gram for n in range(0, n_gram)]) for i in range(len(premise_sents)-n_gram+1)]) for j in range(len(hypo_sents)-n_gram+1)])
+        recall = prec / (len(hypo_sents) - n_gram + 1) if (len(hypo_sents) - n_gram + 1) > 0 else 0.
+        f1 = 2 * prec * recall / (prec + recall)
+        if self.smart_n_metric == 'f1':
+            return None, f1, None
+        elif self.smart_n_metric == 'precision':
+            return None, prec, None
+        elif self.smart_n_metric == 'recall':
+            return None, recall, None
+        else:
+            ValueError("SMART return type error")
+    def chunks(self, lst, n):
+        """Yield successive n-sized chunks from lst."""
+        for i in range(0, len(lst), n):
+            yield lst[i:i + n]
+    def nlg_eval(self, premise, hypo):
+        assert self.nlg_eval_mode is not None, "Select NLG Eval mode!"
+        if (self.nlg_eval_mode == 'bin') or (self.nlg_eval_mode == 'nli') or (self.nlg_eval_mode == 'reg'):
+            return self.inference(premise, hypo)
+        elif (self.nlg_eval_mode == 'bin_sp') or (self.nlg_eval_mode == 'nli_sp') or (self.nlg_eval_mode == 'reg_sp'):
+            return self.inference_example_batch(premise, hypo)
+        else:
+            ValueError("Unrecognized NLG Eval mode!")

alignscore/src/alignscore/model.py ADDED Viewed

	@@ -0,0 +1,308 @@

+import math
+from typing import Optional, Tuple
+from transformers import AdamW, get_linear_schedule_with_warmup, AutoConfig
+from transformers import BertForPreTraining, BertModel, RobertaModel, AlbertModel, AlbertForMaskedLM, RobertaForMaskedLM
+import torch
+import torch.nn as nn
+import pytorch_lightning as pl
+from sklearn.metrics import f1_score
+from dataclasses import dataclass
+class BERTAlignModel(pl.LightningModule):
+    def __init__(self, model='bert-base-uncased', using_pretrained=True, *args, **kwargs) -> None:
+        super().__init__()
+        # Already defined in lightning: self.device
+        self.save_hyperparameters()
+        self.model = model
+        if 'muppet' in model:
+            assert using_pretrained == True, "Only support pretrained muppet!"
+            self.base_model = RobertaModel.from_pretrained(model)
+            self.mlm_head = RobertaForMaskedLM(AutoConfig.from_pretrained(model)).lm_head
+        elif 'roberta' in model:
+            if using_pretrained:
+                self.base_model = RobertaModel.from_pretrained(model)
+                self.mlm_head = RobertaForMaskedLM.from_pretrained(model).lm_head
+            else:
+                self.base_model = RobertaModel(AutoConfig.from_pretrained(model))
+                self.mlm_head = RobertaForMaskedLM(AutoConfig.from_pretrained(model)).lm_head
+        elif 'albert' in model:
+            if using_pretrained:
+                self.base_model = AlbertModel.from_pretrained(model)
+                self.mlm_head = AlbertForMaskedLM.from_pretrained(model).predictions
+            else:
+                self.base_model = AlbertModel(AutoConfig.from_pretrained(model))
+                self.mlm_head = AlbertForMaskedLM(AutoConfig.from_pretrained(model)).predictions
+        elif 'bert' in model:
+            if using_pretrained:
+                self.base_model = BertModel.from_pretrained(model)
+                self.mlm_head = BertForPreTraining.from_pretrained(model).cls.predictions
+            else:
+                self.base_model = BertModel(AutoConfig.from_pretrained(model))
+                self.mlm_head = BertForPreTraining(AutoConfig.from_pretrained(model)).cls.predictions
+        elif 'electra' in model:
+            self.generator = BertModel(AutoConfig.from_pretrained('prajjwal1/bert-small'))
+            self.generator_mlm = BertForPreTraining(AutoConfig.from_pretrained('prajjwal1/bert-small')).cls.predictions
+            self.base_model = BertModel(AutoConfig.from_pretrained('bert-base-uncased'))
+            self.discriminator_predictor = ElectraDiscriminatorPredictions(self.base_model.config)
+        self.bin_layer = nn.Linear(self.base_model.config.hidden_size, 2)
+        self.tri_layer = nn.Linear(self.base_model.config.hidden_size, 3)
+        self.reg_layer = nn.Linear(self.base_model.config.hidden_size, 1)
+        self.dropout = nn.Dropout(p=0.1)
+        self.need_mlm = True
+        self.is_finetune = False
+        self.mlm_loss_factor = 0.5
+        self.softmax = nn.Softmax(dim=-1)
+    def forward(self, batch):
+        if 'electra' in self.model:
+            return self.electra_forward(batch)
+        base_model_output = self.base_model(
+                input_ids = batch['input_ids'],
+                attention_mask = batch['attention_mask'],
+                token_type_ids = batch['token_type_ids'] if 'token_type_ids' in batch.keys() else None
+            )
+        prediction_scores = self.mlm_head(base_model_output.last_hidden_state) ## sequence_output for mlm
+        seq_relationship_score = self.bin_layer(self.dropout(base_model_output.pooler_output)) ## pooled output for classification
+        tri_label_score = self.tri_layer(self.dropout(base_model_output.pooler_output))
+        reg_label_score = self.reg_layer(base_model_output.pooler_output)
+        total_loss = None
+        if 'mlm_label' in batch.keys(): ### 'mlm_label' and 'align_label' when training
+            ce_loss_fct = nn.CrossEntropyLoss(reduction='sum')
+            masked_lm_loss = ce_loss_fct(prediction_scores.view(-1, self.base_model.config.vocab_size), batch['mlm_label'].view(-1)) #/ self.con vocabulary
+            next_sentence_loss = ce_loss_fct(seq_relationship_score.view(-1, 2), batch['align_label'].view(-1)) / math.log(2)
+            tri_label_loss = ce_loss_fct(tri_label_score.view(-1, 3), batch['tri_label'].view(-1)) / math.log(3)
+            reg_label_loss = self.mse_loss(reg_label_score.view(-1), batch['reg_label'].view(-1), reduction='sum')
+            masked_lm_loss_num = torch.sum(batch['mlm_label'].view(-1) != -100)
+            next_sentence_loss_num = torch.sum(batch['align_label'].view(-1) != -100)
+            tri_label_loss_num = torch.sum(batch['tri_label'].view(-1) != -100)
+            reg_label_loss_num = torch.sum(batch['reg_label'].view(-1) != -100.0)
+        return ModelOutput(
+            loss=total_loss,
+            all_loss=[masked_lm_loss, next_sentence_loss, tri_label_loss, reg_label_loss]  if 'mlm_label' in batch.keys() else None,
+            loss_nums=[masked_lm_loss_num, next_sentence_loss_num, tri_label_loss_num, reg_label_loss_num] if 'mlm_label' in batch.keys() else None,
+            prediction_logits=prediction_scores,
+            seq_relationship_logits=seq_relationship_score,
+            tri_label_logits=tri_label_score,
+            reg_label_logits=reg_label_score,
+            hidden_states=base_model_output.hidden_states,
+            attentions=base_model_output.attentions
+        )
+    def electra_forward(self, batch):
+        if 'mlm_label' in batch.keys():
+            ce_loss_fct = nn.CrossEntropyLoss()
+            generator_output = self.generator_mlm(self.generator(
+                input_ids = batch['input_ids'],
+                attention_mask = batch['attention_mask'],
+                token_type_ids = batch['token_type_ids'] if 'token_type_ids' in batch.keys() else None
+            ).last_hidden_state)
+            masked_lm_loss = ce_loss_fct(generator_output.view(-1, self.generator.config.vocab_size), batch['mlm_label'].view(-1))
+            hallucinated_tokens = batch['input_ids'].clone()
+            hallucinated_tokens[batch['mlm_label']!=-100] = torch.argmax(generator_output, dim=-1)[batch['mlm_label']!=-100]
+            replaced_token_label = (batch['input_ids'] == hallucinated_tokens).long()#.type(torch.LongTensor) #[batch['mlm_label'] == -100] = -100
+            replaced_token_label[batch['mlm_label']!=-100] = (batch['mlm_label'] == hallucinated_tokens)[batch['mlm_label']!=-100].long()
+            replaced_token_label[batch['input_ids'] == 0] = -100 ### ignore paddings
+        base_model_output = self.base_model(
+            input_ids = hallucinated_tokens if 'mlm_label' in batch.keys() else batch['input_ids'],
+            attention_mask = batch['attention_mask'],
+            token_type_ids = batch['token_type_ids'] if 'token_type_ids' in batch.keys() else None
+        )
+        hallu_detect_score = self.discriminator_predictor(base_model_output.last_hidden_state)
+        seq_relationship_score = self.bin_layer(self.dropout(base_model_output.pooler_output)) ## pooled output for classification
+        tri_label_score = self.tri_layer(self.dropout(base_model_output.pooler_output))
+        reg_label_score = self.reg_layer(base_model_output.pooler_output)
+        total_loss = None
+        if 'mlm_label' in batch.keys(): ### 'mlm_label' and 'align_label' when training
+            total_loss = []
+            ce_loss_fct = nn.CrossEntropyLoss()
+            hallu_detect_loss = ce_loss_fct(hallu_detect_score.view(-1,2),replaced_token_label.view(-1))
+            next_sentence_loss = ce_loss_fct(seq_relationship_score.view(-1, 2), batch['align_label'].view(-1))
+            tri_label_loss = ce_loss_fct(tri_label_score.view(-1, 3), batch['tri_label'].view(-1))
+            reg_label_loss = self.mse_loss(reg_label_score.view(-1), batch['reg_label'].view(-1))
+            total_loss.append(10.0 * hallu_detect_loss if not torch.isnan(hallu_detect_loss).item() else 0.)
+            total_loss.append(0.2 * masked_lm_loss if (not torch.isnan(masked_lm_loss).item() and self.need_mlm) else 0.)
+            total_loss.append(next_sentence_loss if not torch.isnan(next_sentence_loss).item() else 0.)
+            total_loss.append(tri_label_loss if not torch.isnan(tri_label_loss).item() else 0.)
+            total_loss.append(reg_label_loss if not torch.isnan(reg_label_loss).item() else 0.)
+            total_loss = sum(total_loss)
+        return ModelOutput(
+            loss=total_loss,
+            all_loss=[masked_lm_loss, next_sentence_loss, tri_label_loss, reg_label_loss, hallu_detect_loss]  if 'mlm_label' in batch.keys() else None,
+            prediction_logits=hallu_detect_score,
+            seq_relationship_logits=seq_relationship_score,
+            tri_label_logits=tri_label_score,
+            reg_label_logits=reg_label_score,
+            hidden_states=base_model_output.hidden_states,
+            attentions=base_model_output.attentions
+        )
+    def training_step(self, train_batch, batch_idx):
+        output = self(train_batch)
+        return {'losses': output.all_loss, 'loss_nums': output.loss_nums}
+    def training_step_end(self, step_output):
+        losses = step_output['losses']
+        loss_nums = step_output['loss_nums']
+        assert len(loss_nums) == len(losses), 'loss_num should be the same length as losses'
+        loss_mlm_num = torch.sum(loss_nums[0])
+        loss_bin_num = torch.sum(loss_nums[1])
+        loss_tri_num = torch.sum(loss_nums[2])
+        loss_reg_num = torch.sum(loss_nums[3])
+        loss_mlm = torch.sum(losses[0]) / loss_mlm_num if loss_mlm_num > 0 else 0.
+        loss_bin = torch.sum(losses[1]) / loss_bin_num if loss_bin_num > 0 else 0.
+        loss_tri = torch.sum(losses[2]) / loss_tri_num if loss_tri_num > 0 else 0.
+        loss_reg = torch.sum(losses[3]) / loss_reg_num if loss_reg_num > 0 else 0.
+        total_loss = self.mlm_loss_factor * loss_mlm + loss_bin + loss_tri + loss_reg
+        self.log('train_loss', total_loss)# , sync_dist=True
+        self.log('mlm_loss', loss_mlm)
+        self.log('bin_label_loss', loss_bin)
+        self.log('tri_label_loss', loss_tri)
+        self.log('reg_label_loss', loss_reg)
+        return total_loss
+    def validation_step(self, val_batch, batch_idx):
+        if not self.is_finetune:
+            with torch.no_grad():
+                output = self(val_batch)
+            return {'losses': output.all_loss, 'loss_nums': output.loss_nums}
+        with torch.no_grad():
+            output = self(val_batch)['seq_relationship_logits']
+            output = self.softmax(output)[:, 1].tolist()
+            pred = [int(align_prob>0.5) for align_prob in output]
+            labels = val_batch['align_label'].tolist()
+        return {"pred": pred, 'labels': labels}#, "preds":preds, "labels":x['labels']}
+    def validation_step_end(self, step_output):
+        losses = step_output['losses']
+        loss_nums = step_output['loss_nums']
+        assert len(loss_nums) == len(losses), 'loss_num should be the same length as losses'
+        loss_mlm_num = torch.sum(loss_nums[0])
+        loss_bin_num = torch.sum(loss_nums[1])
+        loss_tri_num = torch.sum(loss_nums[2])
+        loss_reg_num = torch.sum(loss_nums[3])
+        loss_mlm = torch.sum(losses[0]) / loss_mlm_num if loss_mlm_num > 0 else 0.
+        loss_bin = torch.sum(losses[1]) / loss_bin_num if loss_bin_num > 0 else 0.
+        loss_tri = torch.sum(losses[2]) / loss_tri_num if loss_tri_num > 0 else 0.
+        loss_reg = torch.sum(losses[3]) / loss_reg_num if loss_reg_num > 0 else 0.
+        total_loss = self.mlm_loss_factor * loss_mlm + loss_bin + loss_tri + loss_reg
+        self.log('train_loss', total_loss)# , sync_dist=True
+        self.log('mlm_loss', loss_mlm)
+        self.log('bin_label_loss', loss_bin)
+        self.log('tri_label_loss', loss_tri)
+        self.log('reg_label_loss', loss_reg)
+        return total_loss
+    def validation_epoch_end(self, outputs):
+        if not self.is_finetune:
+            total_loss = torch.stack(outputs).mean()
+            self.log("val_loss", total_loss, prog_bar=True, sync_dist=True)
+        else:
+            all_predictions = []
+            all_labels = []
+            for each_output in outputs:
+                all_predictions.extend(each_output['pred'])
+                all_labels.extend(each_output['labels'])
+            self.log("f1", f1_score(all_labels, all_predictions), prog_bar=True, sync_dist=True)
+    def configure_optimizers(self):
+        """Prepare optimizer and schedule (linear warmup and decay)"""
+        no_decay = ["bias", "LayerNorm.weight"]
+        optimizer_grouped_parameters = [
+            {
+                "params": [p for n, p in self.named_parameters() if not any(nd in n for nd in no_decay)],
+                "weight_decay": self.hparams.weight_decay,
+            },
+            {
+                "params": [p for n, p in self.named_parameters() if any(nd in n for nd in no_decay)],
+                "weight_decay": 0.0,
+            },
+        ]
+        optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
+        scheduler = get_linear_schedule_with_warmup(
+            optimizer,
+            num_warmup_steps=int(self.hparams.warmup_steps_portion * self.trainer.estimated_stepping_batches),
+            num_training_steps=self.trainer.estimated_stepping_batches,
+        )
+        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
+        return [optimizer], [scheduler]
+    def mse_loss(self, input, target, ignored_index=-100.0, reduction='mean'):
+        mask = (target == ignored_index)
+        out = (input[~mask]-target[~mask])**2
+        if reduction == "mean":
+            return out.mean()
+        elif reduction == "sum":
+            return out.sum()
+class ElectraDiscriminatorPredictions(nn.Module):
+    """Prediction module for the discriminator, made up of two dense layers."""
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dense_prediction = nn.Linear(config.hidden_size, 2)
+        self.config = config
+        self.gelu = nn.GELU()
+    def forward(self, discriminator_hidden_states):
+        hidden_states = self.dense(discriminator_hidden_states)
+        hidden_states = self.gelu(hidden_states)
+        logits = self.dense_prediction(hidden_states).squeeze(-1)
+        return logits
+@dataclass
+class ModelOutput():
+    loss: Optional[torch.FloatTensor] = None
+    all_loss: Optional[list] = None
+    loss_nums: Optional[list] = None
+    prediction_logits: torch.FloatTensor = None
+    seq_relationship_logits: torch.FloatTensor = None
+    tri_label_logits: torch.FloatTensor = None
+    reg_label_logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None

alignscore/train.py ADDED Viewed

	@@ -0,0 +1,144 @@

+from pytorch_lightning import Trainer, seed_everything
+from alignscore.dataloader import DSTDataLoader
+from alignscore.model import BERTAlignModel
+from pytorch_lightning.callbacks import ModelCheckpoint
+from argparse import ArgumentParser
+import os
+def train(datasets, args):
+    dm = DSTDataLoader(
+        dataset_config=datasets,
+        model_name=args.model_name,
+        sample_mode='seq',
+        train_batch_size=args.batch_size,
+        eval_batch_size=16,
+        num_workers=args.num_workers,
+        train_eval_split=0.95,
+        need_mlm=args.do_mlm
+    )
+    dm.setup()
+    model = BERTAlignModel(model=args.model_name, using_pretrained=args.use_pretrained_model,
+        adam_epsilon=args.adam_epsilon,
+        learning_rate=args.learning_rate,
+        weight_decay=args.weight_decay,
+        warmup_steps_portion=args.warm_up_proportion
+    )
+    model.need_mlm = args.do_mlm
+    training_dataset_used = '_'.join(datasets.keys())
+    checkpoint_name = '_'.join((
+        f"{args.ckpt_comment}{args.model_name.replace('/', '-')}",
+        f"{'scratch_' if not args.use_pretrained_model else ''}{'no_mlm_' if not args.do_mlm else ''}{training_dataset_used}",
+        str(args.max_samples_per_dataset),
+        f"{args.batch_size}x{len(args.devices)}x{args.accumulate_grad_batch}"
+    ))
+    checkpoint_callback = ModelCheckpoint(
+        dirpath=args.ckpt_save_path,
+        filename=checkpoint_name + "_{epoch:02d}_{step}",
+        every_n_train_steps=10000,
+        save_top_k=1
+    )
+    trainer = Trainer(
+        accelerator='gpu',
+        max_epochs=args.num_epoch,
+        devices=args.devices,
+        strategy="dp",
+        precision=32,
+        callbacks=[checkpoint_callback],
+        accumulate_grad_batches=args.accumulate_grad_batch
+    )
+    trainer.fit(model, datamodule=dm)
+    trainer.save_checkpoint(os.path.join(args.ckpt_save_path, f"{checkpoint_name}_final.ckpt"))
+    print("Training is finished.")
+if __name__ == "__main__":
+    ALL_TRAINING_DATASETS = {
+        ### NLI
+        'mnli': {'task_type': 'nli', 'data_path': 'mnli.json'},
+        'doc_nli': {'task_type': 'bin_nli', 'data_path': 'doc_nli.json'},
+        'snli': {'task_type': 'nli', 'data_path': 'snli.json'},
+        'anli_r1': {'task_type': 'nli', 'data_path': 'anli_r1.json'},
+        'anli_r2': {'task_type': 'nli', 'data_path': 'anli_r2.json'},
+        'anli_r3': {'task_type': 'nli', 'data_path': 'anli_r3.json'},
+        ### fact checking
+        'nli_fever': {'task_type': 'fact_checking', 'data_path': 'nli_fever.json'},
+        'vitaminc': {'task_type': 'fact_checking', 'data_path': 'vitaminc.json'},
+        ### paraphrase
+        'paws': {'task_type': 'paraphrase', 'data_path': 'paws.json'},
+        'paws_qqp': {'task_type': 'paraphrase', 'data_path': 'paws_qqp.json'},
+        'paws_unlabeled': {'task_type': 'paraphrase', 'data_path': 'paws_unlabeled.json'},
+        'qqp': {'task_type': 'paraphrase', 'data_path': 'qqp.json'},
+        'wiki103': {'task_type': 'paraphrase', 'data_path': 'wiki103.json'},
+        ### QA
+        'squad_v2': {'task_type': 'qa', 'data_path': 'squad_v2_new.json'},
+        'race': {'task_type': 'qa', 'data_path': 'race.json'},
+        'adversarial_qa': {'task_type': 'qa', 'data_path': 'adversarial_qa.json'},
+        'drop': {'task_type': 'qa', 'data_path': 'drop.json'},
+        'hotpot_qa_distractor': {'task_type': 'qa', 'data_path': 'hotpot_qa_distractor.json'},
+        'hotpot_qa_fullwiki': {'task_type': 'qa', 'data_path': 'hotpot_qa_fullwiki.json'},
+        'newsqa': {'task_type': 'qa', 'data_path': 'newsqa.json'},
+        'quoref': {'task_type': 'qa', 'data_path': 'quoref.json'},
+        'ropes': {'task_type': 'qa', 'data_path': 'ropes.json'},
+        'boolq': {'task_type': 'qa', 'data_path': 'boolq.json'},
+        'eraser_multi_rc': {'task_type': 'qa', 'data_path': 'eraser_multi_rc.json'},
+        'quail': {'task_type': 'qa', 'data_path': 'quail.json'},
+        'sciq': {'task_type': 'qa', 'data_path': 'sciq.json'},
+        'strategy_qa': {'task_type': 'qa', 'data_path': 'strategy_qa.json'},
+        ### Coreference
+        'gap': {'task_type': 'coreference', 'data_path': 'gap.json'},
+        ### Summarization
+        'wikihow': {'task_type': 'summarization', 'data_path': 'wikihow.json'},
+        ### Information Retrieval
+        'msmarco': {'task_type': 'ir', 'data_path': 'msmarco.json'},
+        ### STS
+        'stsb': {'task_type': 'sts', 'data_path': 'stsb.json'},
+        'sick': {'task_type': 'sts', 'data_path': 'sick.json'},
+    }
+    parser = ArgumentParser()
+    parser.add_argument('--seed', type=int, default=2022)
+    parser.add_argument('--batch-size', type=int, default=32)
+    parser.add_argument('--accumulate-grad-batch', type=int, default=1)
+    parser.add_argument('--num-epoch', type=int, default=3)
+    parser.add_argument('--num-workers', type=int, default=8)
+    parser.add_argument('--warm-up-proportion', type=float, default=0.06)
+    parser.add_argument('--adam-epsilon', type=float, default=1e-6)
+    parser.add_argument('--weight-decay', type=float, default=0.1)
+    parser.add_argument('--learning-rate', type=float, default=1e-5)
+    parser.add_argument('--val-check-interval', type=float, default=1. / 4)
+    parser.add_argument('--devices', nargs='+', type=int, required=True)
+    parser.add_argument('--model-name', type=str, default="roberta-large")
+    parser.add_argument('--ckpt-save-path', type=str, required=True)
+    parser.add_argument('--ckpt-comment', type=str, default="")
+    parser.add_argument('--trainin-datasets', nargs='+', type=str, default=list(ALL_TRAINING_DATASETS.keys()), choices=list(ALL_TRAINING_DATASETS.keys()))
+    parser.add_argument('--data-path', type=str, required=True)
+    parser.add_argument('--max-samples-per-dataset', type=int, default=500000)
+    parser.add_argument('--do-mlm', type=bool, default=False)
+    parser.add_argument('--use-pretrained-model', type=bool, default=True)
+    args = parser.parse_args()
+    seed_everything(args.seed)
+    datasets = {
+        name: {
+            **ALL_TRAINING_DATASETS[name],
+            "size": args.max_samples_per_dataset,
+            "data_path": os.path.join(args.data_path, ALL_TRAINING_DATASETS[name]['data_path'])
+        }
+        for name in args.trainin_datasets
+    }
+    train(datasets, args)