File size: 4,083 Bytes
0c16056
 
 
 
fe8ccbe
0c16056
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfa7c55
 
 
 
 
0c16056
 
4b9316c
 
0c16056
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
- en
pipeline_tag: text-classification
license: mit
---
# Model Summary

This is a fact-checking model from our work:

📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))

The model is based on RoBERTA-Large that predicts a binary label - 1 for supported and 0 for unsupported. 
The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine 
whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**


MiniCheck-RoBERTa-Large is fine tuned from the trained RoBERTA-Large model from AlignScore ([Zha et al., 2023](https://aclanthology.org/2023.acl-long.634.pdf)) 
on 14K synthetic data generated from scratch in a structed way (more details in the paper).


### Model Variants
We also have other two MiniCheck model variants:
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
- [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)


### Model Performance

<p align="center">
    <img src="./cost-vs-bacc.png" width="360">
</p>

The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact), 
from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-RoBERTa-Large outperform all
exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large, which
is on par with GPT-4 but 400x cheaper. See full results in our work.

Note: We only evaluated the performance of our models on real claims -- without any human intervention in 
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
LLMs' actual behaviors.


# Model Usage Demo

Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`.


### Below is a simple use case

```python
from minicheck.minicheck import MiniCheck
doc = "A group of students gather in the school library to study for their upcoming final exams."
claim_1 = "The students are preparing for an examination."
claim_2 = "The students are on vacation."

# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
print(pred_label) # [1, 0]
print(raw_prob)   # [0.9581979513168335, 0.031335990875959396]
```

### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark

```python
import pandas as pd
from datasets import load_dataset
from minicheck.minicheck import MiniCheck

# load 13K test data
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values

scorer = MiniCheck(model_name='roberta-large', device=f'cuda:0', cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims)  # ~ 15 mins, depending on hardware
```

To evalaute the result on the benchmark
```python
from sklearn.metrics import balanced_accuracy_score
df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
    sub_df = df[df.dataset == dataset]
    bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
    result_df.loc[len(result_df)] = [dataset, bacc]
result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)
```

# Citation

```
@misc{tang2024minicheck,
      title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents}, 
      author={Liyan Tang and Philippe Laban and Greg Durrett},
      year={2024},
      eprint={2404.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```