Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,142 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
# AlignScore
|
5 |
+
This is the repository for AlignScore, a metric for automatic factual consistency evaluation of text pairs. The metric is introduced in
|
6 |
+
|
7 |
+
[AlignScore: Evaluating Factual Consistency with a Unified Alignment Function](https://arxiv.org/abs/2305.16739)
|
8 |
+
|
9 |
+
Yuheng Zha, Yichi Yang, Ruichen Li and Zhiting Hu
|
10 |
+
|
11 |
+
ACL 2023
|
12 |
+
|
13 |
+
What is factual consistency and its evaluation?
|
14 |
+
* **Facutual Consistency:** For a given text pair (**a**, **b**), they are considered factual consistent if 1) all the information in **b** is also present in **a**; 2) **b** does not contradict **a**.
|
15 |
+
* **Evaluation:** Show the degree of factual consistency between the context (text **a**) and the claim (text **b**).
|
16 |
+
|
17 |
+
Where is factual consistency evaluation applicable?
|
18 |
+
* **Summarization**: document and summary
|
19 |
+
* **Paraphrase**: sentence A and sentence B
|
20 |
+
* **Dialog**: context and response
|
21 |
+
* ...
|
22 |
+
|
23 |
+
# Leaderboard
|
24 |
+
We list the performance of AlignScore as well as other metrics here.
|
25 |
+
|
26 |
+
| Rank | Metrics | SummaC* | TRUE** | Other-Spearman | Average | Paper | Code |
|
27 |
+
| ---- | :--------------- | :-----: | :----: | :------------: | :-----: | :---: | :--: |
|
28 |
+
| 1 | AlignScore-large | 88.6 | 83.8 | 49.3 | 73.9 | - | - |
|
29 |
+
| 2 | AlignScore-base | 87.4 | 82.5 | 44.9 | 71.6 | - | - |
|
30 |
+
| 3 | QAFactEval | 83.8 | 79.4 | 42.4 | 68.5 | [:page\_facing\_up:(Fabbri et al. 2022)](https://arxiv.org/abs/2112.08542) | [:octocat:](https://github.com/salesforce/QAFactEval) |
|
31 |
+
| 4 | UniEval | 84.6 | 78.0 | 41.5 | 68.0 | [:page\_facing\_up:(Zhong et al. 2022)](https://arxiv.org/abs/2210.07197) | [:octocat:](https://github.com/maszhongming/UniEval) |
|
32 |
+
| 5 | SummaC-CONV | 81.0 | 78.7 | 34.2 | 64.6 | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
|
33 |
+
| 6 | BARTScore | 80.9 | 73.4 | 34.8 | 63.0 | [:page\_facing\_up:(Yuan et al. 2022)](https://arxiv.org/abs/2106.11520) | [:octocat:](https://github.com/neulab/BARTScore) |
|
34 |
+
| 7 | CTC | 81.2 | 72.4 | 35.3 | 63.0 | [:page\_facing\_up:(Deng et al. 2022)](https://arxiv.org/abs/2109.06379) | [:octocat:](https://github.com/tanyuqian/ctc-gen-eval) |
|
35 |
+
| 8 | SummaC-ZS | 79.0 | 78.2 | 30.4 | 62.5 | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
|
36 |
+
| 9 | ROUGE-2 | 78.1 | 72.4 | 27.9 | 59.5 | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
|
37 |
+
| 10 | ROUGE-1 | 77.4 | 72.0 | 28.6 | 59.3 | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
|
38 |
+
| 11 | ROUGE-L | 77.3 | 71.8 | 28.3 | 59.1 | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
|
39 |
+
| 12 | QuestEval | 72.5 | 71.4 | 25.0 | 56.3 | [:page\_facing\_up:(Scialom et al. 2021)](https://arxiv.org/abs/2103.12693) | [:octocat:](https://github.com/ThomasScialom/QuestEval) |
|
40 |
+
| 13 | BLEU | 76.3 | 67.3 | 24.6 | 56.1 | [:page\_facing\_up:(Papineni et al. 2002)](https://aclanthology.org/P02-1040/) | [:octocat:](https://www.nltk.org/_modules/nltk/translate/bleu_score.html) |
|
41 |
+
| 14 | DAE | 66.8 | 65.7 | 35.1 | 55.8 | [:page\_facing\_up:(Goyal and Durrett 2020)](https://aclanthology.org/2020.findings-emnlp.322/) | [:octocat:](https://github.com/tagoyal/dae-factuality) |
|
42 |
+
| 15 | BLEURT | 69.2 | 71.9 | 24.9 | 55.4 | [:page\_facing\_up:(Sellam et al. 2020)](https://arxiv.org/abs/2004.04696) | [:octocat:](https://github.com/google-research/bleurt) |
|
43 |
+
| 16 | BERTScore | 72.1 | 68.6 | 21.9 | 54.2 | [:page\_facing\_up:(Zhang et al. 2020)](https://arxiv.org/abs/1904.09675) | [:octocat:](https://github.com/Tiiiger/bert_score) |
|
44 |
+
| 17 | SimCSE | 67.4 | 70.3 | 23.8 | 53.8 | [:page\_facing\_up:(Gao et al. 2021)](https://arxiv.org/abs/2104.08821) | [:octocat:](https://github.com/princeton-nlp/SimCSE) |
|
45 |
+
| 18 | FactCC | 68.8 | 62.7 | 21.2 | 50.9 | [:page\_facing\_up:(Kryscinski et al. 2020)](https://arxiv.org/abs/1910.12840) | [:octocat:](https://github.com/salesforce/factCC) |
|
46 |
+
| 19 | BLANC | 65.1 | 64.0 | 14.4 | 47.8 | [:page\_facing\_up:(Vasilyev et al. 2020)](https://arxiv.org/abs/2002.09836) | [:octocat:](https://github.com/PrimerAI/blanc) |
|
47 |
+
| 20 | NER-Overlap | 60.4 | 59.3 | 18.9 | 46.2 | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
|
48 |
+
| 21 | MNLI | 47.9 | 60.4 | 3.1 | 37.2 | [:page\_facing\_up:(Williams et al. 2018)](https://arxiv.org/abs/1704.05426) | [:octocat:](https://github.com/nyu-mll/multiNLI) |
|
49 |
+
| 22 | FEQA | 48.3 | 52.2 | -1.9 | 32.9 | [:page\_facing\_up:(Durmus et al. 2020)](https://arxiv.org/abs/2005.03754) | [:octocat:](https://github.com/esdurmus/feqa) |
|
50 |
+
|
51 |
+
\* SummaC: [\[Paper\]](https://arxiv.org/abs/2111.09525) \| [\[Github\]](https://github.com/tingofurro/summac)
|
52 |
+
|
53 |
+
** TRUE: [\[Paper\]](https://arxiv.org/abs/2204.04991) \| [\[Github\]](https://github.com/google-research/true)
|
54 |
+
|
55 |
+
|
56 |
+
# Installation
|
57 |
+
|
58 |
+
Our models are trained and evaluated using PyTorch 1.12.1. We recommend using this version to reproduce the results.
|
59 |
+
|
60 |
+
1. Please first install the right version of PyTorch before installing `alignscore`.
|
61 |
+
2. You can install `alignscore` by cloning this repository and `pip install .`.
|
62 |
+
3. After installing `alignscore`, please use `python -m spacy download en_core_web_sm` to install the required spaCy model (we use `spaCy` for sentenization).
|
63 |
+
|
64 |
+
# Evaluating Factual Consistency
|
65 |
+
To evaluate the factual consistency of the `claim` w.r.t. the `context`, simply use the score method of `AlignScore`.
|
66 |
+
```python
|
67 |
+
from alignscore import AlignScore
|
68 |
+
|
69 |
+
scorer = AlignScore(model='roberta-base', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
|
70 |
+
score = scorer.score(contexts=['hello world'], claims=['hello world'])
|
71 |
+
```
|
72 |
+
`model`: the backbone model of the metric. Now, we only provide the metric trained on RoBERTa
|
73 |
+
|
74 |
+
`batch_size`: the batch size of the inference
|
75 |
+
|
76 |
+
`device`: which device to run the metric
|
77 |
+
|
78 |
+
`ckpt_path`: the path to the checkpoint
|
79 |
+
|
80 |
+
`evaluation_mode`: choose from `'nli_sp', 'nli', 'bin_sp', 'bin'`. `nli` and `bin` refer to the 3-way and binary classficiation head, respectively. `sp` indicates if the chunk-sentence splitting method is used. `nli_sp` is the default setting of AlignScore
|
81 |
+
|
82 |
+
|
83 |
+
# Checkpoints
|
84 |
+
We provide two versions of the AlignScore checkpoints: `AlignScore-base` and `AlignScore-large`. The `-base` model is based on RoBERTa-base and has 125M parameters. The `-large` model is based on RoBERTa-large and has 355M parameters.
|
85 |
+
|
86 |
+
**AlignScore-base**:
|
87 |
+
https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt
|
88 |
+
|
89 |
+
**AlignScore-large**:
|
90 |
+
https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt
|
91 |
+
|
92 |
+
# Training
|
93 |
+
You can use the above checkpoints directly for factual consistency evaluation. However, if you wish to train an alignment model from scratch / on your own data, use `train.py`.
|
94 |
+
```python
|
95 |
+
python train.py --seed 2022 --batch-size 32 \
|
96 |
+
--num-epoch 3 --devices 0 1 2 3 \
|
97 |
+
--model-name roberta-large -- ckpt-save-path ./ckpt/ \
|
98 |
+
--data-path ./data/training_sets/ \
|
99 |
+
--max-samples-per-dataset 500000
|
100 |
+
```
|
101 |
+
|
102 |
+
`--seed`: the random seed for initialization
|
103 |
+
|
104 |
+
`--batch-size`: the batch size for training
|
105 |
+
|
106 |
+
`--num-epoch`: training epochs
|
107 |
+
|
108 |
+
`--devices`: which devices to train the metric, a list of GPU ids
|
109 |
+
|
110 |
+
`--model-name`: the backbone model name of the metric, default RoBERTa-large
|
111 |
+
|
112 |
+
`--ckpt-save-path`: the path to save the checkpoint
|
113 |
+
|
114 |
+
`--training-datasets`: the names of the training datasets
|
115 |
+
|
116 |
+
`--data-path`: the path to the training datasets
|
117 |
+
|
118 |
+
`--max-samples-per-dataset`: the maximum number of samples from a dataset
|
119 |
+
|
120 |
+
# Benchmarking
|
121 |
+
Our benchmark includes the TRUE and SummaC benchmark as well as several popular factual consistency evaluation datasets.
|
122 |
+
|
123 |
+
To run the benchmark, a few additional dependencies are required and can be installed with `pip install -r requirements.txt`.
|
124 |
+
Additionally, some depedencies are not available as packages and need to be downloaded manually (please see `python benchmark.py --help` for instructions).
|
125 |
+
|
126 |
+
Note installing `summac` may cause dependency conflicts with `alignscore`. Please reinstall `alignscore` to force the correct dependency versions.
|
127 |
+
|
128 |
+
The relevant arguments for evaluating AlignScore are:
|
129 |
+
|
130 |
+
`--alignscore`: evaluation the AlignScore metric
|
131 |
+
|
132 |
+
`--alignscore-model`: the name of the backbone model (either 'roberta-base' or 'roberta-large')
|
133 |
+
|
134 |
+
`--alignscore-ckpt`: the path to the saved checkpoint
|
135 |
+
|
136 |
+
`--alignscore-eval-mode`: the evaluation mode, defaults to `nli_sp`
|
137 |
+
|
138 |
+
`--device`: which device to run the metric, defaults to `cuda:0`
|
139 |
+
|
140 |
+
`--tasks`: which tasks to benchmark, e.g., SummEval, QAGS-CNNDM, ...
|
141 |
+
|
142 |
+
For the baselines, please see `python benchmark.py --help` for details.
|