File size: 9,533 Bytes
888324c
 
 
06cd388
8509e78
06cd388
 
 
 
 
 
 
8509e78
 
06cd388
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: mit
---
# AlignScore
This is the repository for AlignScore and its checkpoints, a metric for automatic factual consistency evaluation of text pairs. The metric is introduced in 

[AlignScore: Evaluating Factual Consistency with a Unified Alignment Function](https://arxiv.org/abs/2305.16739)

Yuheng Zha, Yichi Yang, Ruichen Li and Zhiting Hu

ACL 2023

Code is at https://github.com/yuh-zha/AlignScore

What is factual consistency and its evaluation?
* **Facutual Consistency:** For a given text pair (**a**, **b**), they are considered factual consistent if 1) all the information in **b** is also present in **a**; 2) **b** does not contradict **a**.
* **Evaluation:**  Show the degree of factual consistency between the context (text **a**) and the claim (text **b**).

Where is factual consistency evaluation applicable?
* **Summarization**: document and summary
* **Paraphrase**: sentence A and sentence B
* **Dialog**: context and response
* ...

# Leaderboard
We list the performance of AlignScore as well as other metrics here. 

| Rank | Metrics          | SummaC* | TRUE** | Other-Spearman | Average | Paper | Code |
| ---- | :--------------- | :-----: | :----: | :------------: | :-----: | :---: | :--: |
| 1    | AlignScore-large |  88.6   |  83.8  |      49.3      |  73.9   |   -   |   -  |
| 2    | AlignScore-base  |  87.4   |  82.5  |      44.9      |  71.6   |   -   |   -  |
| 3    | QAFactEval       |  83.8   |  79.4  |      42.4      |  68.5   | [:page\_facing\_up:(Fabbri et al. 2022)](https://arxiv.org/abs/2112.08542) | [:octocat:](https://github.com/salesforce/QAFactEval) |
| 4    | UniEval          |  84.6   |  78.0  |      41.5      |  68.0   | [:page\_facing\_up:(Zhong et al. 2022)](https://arxiv.org/abs/2210.07197) | [:octocat:](https://github.com/maszhongming/UniEval) |
| 5    | SummaC-CONV      |  81.0   |  78.7  |      34.2      |  64.6   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
| 6    | BARTScore        |  80.9   |  73.4  |      34.8      |  63.0   | [:page\_facing\_up:(Yuan et al. 2022)](https://arxiv.org/abs/2106.11520) | [:octocat:](https://github.com/neulab/BARTScore) |
| 7    | CTC              |  81.2   |  72.4  |      35.3      |  63.0   | [:page\_facing\_up:(Deng et al. 2022)](https://arxiv.org/abs/2109.06379) | [:octocat:](https://github.com/tanyuqian/ctc-gen-eval) |
| 8    | SummaC-ZS        |  79.0   |  78.2  |      30.4      |  62.5   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
| 9    | ROUGE-2          |  78.1   |  72.4  |      27.9      |  59.5   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
| 10   | ROUGE-1          |  77.4   |  72.0  |      28.6      |  59.3   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
| 11   | ROUGE-L          |  77.3   |  71.8  |      28.3      |  59.1   | [:page\_facing\_up:(Lin 2004)](https://aclanthology.org/W04-1013/) | [:octocat:](https://github.com/pltrdy/rouge) |
| 12   | QuestEval        |  72.5   |  71.4  |      25.0      |  56.3   | [:page\_facing\_up:(Scialom et al. 2021)](https://arxiv.org/abs/2103.12693) | [:octocat:](https://github.com/ThomasScialom/QuestEval) |
| 13   | BLEU             |  76.3   |  67.3  |      24.6      |  56.1   | [:page\_facing\_up:(Papineni et al. 2002)](https://aclanthology.org/P02-1040/) | [:octocat:](https://www.nltk.org/_modules/nltk/translate/bleu_score.html) |
| 14   | DAE              |  66.8   |  65.7  |      35.1      |  55.8   | [:page\_facing\_up:(Goyal and Durrett 2020)](https://aclanthology.org/2020.findings-emnlp.322/) | [:octocat:](https://github.com/tagoyal/dae-factuality) |
| 15   | BLEURT           |  69.2   |  71.9  |      24.9      |  55.4   | [:page\_facing\_up:(Sellam et al. 2020)](https://arxiv.org/abs/2004.04696) | [:octocat:](https://github.com/google-research/bleurt) |
| 16   | BERTScore        |  72.1   |  68.6  |      21.9      |  54.2   | [:page\_facing\_up:(Zhang et al. 2020)](https://arxiv.org/abs/1904.09675) | [:octocat:](https://github.com/Tiiiger/bert_score) |
| 17   | SimCSE           |  67.4   |  70.3  |      23.8      |  53.8   | [:page\_facing\_up:(Gao et al. 2021)](https://arxiv.org/abs/2104.08821) | [:octocat:](https://github.com/princeton-nlp/SimCSE) |
| 18   | FactCC           |  68.8   |  62.7  |      21.2      |  50.9   | [:page\_facing\_up:(Kryscinski et al. 2020)](https://arxiv.org/abs/1910.12840) | [:octocat:](https://github.com/salesforce/factCC) |
| 19   | BLANC            |  65.1   |  64.0  |      14.4      |  47.8   | [:page\_facing\_up:(Vasilyev et al. 2020)](https://arxiv.org/abs/2002.09836) | [:octocat:](https://github.com/PrimerAI/blanc) |
| 20   | NER-Overlap      |  60.4   |  59.3  |      18.9      |  46.2   | [:page\_facing\_up:(Laban et al. 2022)](https://arxiv.org/abs/2111.09525) | [:octocat:](https://github.com/tingofurro/summac) |
| 21   | MNLI             |  47.9   |  60.4  |      3.1       |  37.2   | [:page\_facing\_up:(Williams et al. 2018)](https://arxiv.org/abs/1704.05426) | [:octocat:](https://github.com/nyu-mll/multiNLI) |
| 22   | FEQA             |  48.3   |  52.2  |      -1.9      |  32.9   | [:page\_facing\_up:(Durmus et al. 2020)](https://arxiv.org/abs/2005.03754) | [:octocat:](https://github.com/esdurmus/feqa) |

\*  SummaC: [\[Paper\]](https://arxiv.org/abs/2111.09525) \| [\[Github\]](https://github.com/tingofurro/summac)

** TRUE: [\[Paper\]](https://arxiv.org/abs/2204.04991) \| [\[Github\]](https://github.com/google-research/true)


# Installation

Our models are trained and evaluated using PyTorch 1.12.1. We recommend using this version to reproduce the results.

1. Please first install the right version of PyTorch before installing `alignscore`.
2. You can install `alignscore` by cloning this repository and `pip install .`.
3. After installing `alignscore`, please use `python -m spacy download en_core_web_sm` to install the required spaCy model (we use `spaCy` for sentenization).

# Evaluating Factual Consistency
To evaluate the factual consistency of the `claim` w.r.t. the `context`, simply use the score method of `AlignScore`.
```python
from alignscore import AlignScore

scorer = AlignScore(model='roberta-base', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
score = scorer.score(contexts=['hello world'], claims=['hello world'])
```
`model`: the backbone model of the metric. Now, we only provide the metric trained on RoBERTa

`batch_size`: the batch size of the inference

`device`: which device to run the metric

`ckpt_path`: the path to the checkpoint

`evaluation_mode`: choose from `'nli_sp', 'nli', 'bin_sp', 'bin'`. `nli` and `bin` refer to the 3-way and binary classficiation head, respectively. `sp` indicates if the chunk-sentence splitting method is used. `nli_sp` is the default setting of AlignScore


# Checkpoints
We provide two versions of the AlignScore checkpoints: `AlignScore-base` and `AlignScore-large`. The `-base` model is based on RoBERTa-base and has 125M parameters. The `-large` model is based on RoBERTa-large and has 355M parameters. 

**AlignScore-base**: 
https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt

**AlignScore-large**:
https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt

# Training  
You can use the above checkpoints directly for factual consistency evaluation. However, if you wish to train an alignment model from scratch / on your own data, use `train.py`.
```python
python train.py --seed 2022 --batch-size 32 \
--num-epoch 3 --devices 0 1 2 3 \
--model-name roberta-large -- ckpt-save-path ./ckpt/ \
--data-path ./data/training_sets/ \
--max-samples-per-dataset 500000
```

`--seed`: the random seed for initialization

`--batch-size`: the batch size for training

`--num-epoch`: training epochs

`--devices`: which devices to train the metric, a list of GPU ids

`--model-name`: the backbone model name of the metric, default RoBERTa-large

`--ckpt-save-path`: the path to save the checkpoint

`--training-datasets`: the names of the training datasets

`--data-path`: the path to the training datasets

`--max-samples-per-dataset`: the maximum number of samples from a dataset

# Benchmarking
Our benchmark includes the TRUE and SummaC benchmark as well as several popular factual consistency evaluation datasets.

To run the benchmark, a few additional dependencies are required and can be installed with `pip install -r requirements.txt`.
Additionally, some depedencies are not available as packages and need to be downloaded manually (please see `python benchmark.py --help` for instructions).

Note installing `summac` may cause dependency conflicts with `alignscore`. Please reinstall `alignscore` to force the correct dependency versions.

The relevant arguments for evaluating AlignScore are:

`--alignscore`: evaluation the AlignScore metric

`--alignscore-model`: the name of the backbone model (either 'roberta-base' or 'roberta-large')

`--alignscore-ckpt`: the path to the saved checkpoint

`--alignscore-eval-mode`: the evaluation mode, defaults to `nli_sp`

`--device`: which device to run the metric, defaults to `cuda:0`

`--tasks`: which tasks to benchmark, e.g., SummEval, QAGS-CNNDM, ...

For the baselines, please see `python benchmark.py --help` for details.