File size: 4,259 Bytes
8a4f0c2 ba29ef1 8a4f0c2 a9ce7f5 8a4f0c2 58ead4c a9ce7f5 58ead4c 8a4f0c2 a9ce7f5 8a4f0c2 a9ce7f5 8a4f0c2 a9ce7f5 8a4f0c2 a9ce7f5 8a4f0c2 a9ce7f5 8a4f0c2 a9ce7f5 8a4f0c2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
language:
- en
pipeline_tag: text-classification
license: mit
---
# Model Summary
This is a fact-checking model from our work:
📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck))
The model is based on DeBERTa-v3-Large that predicts a binary label - 1 for supported and 0 for unsupported.
The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine
whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
MiniCheck-DeBERTa-v3-Large is fine tuned from `microsoft/deberta-v3-large` ([He et al., 2023](https://arxiv.org/pdf/2111.09543.pdf))
on the combination of 35K data:
- 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
- 14K synthetic data generated from scratch in a structed way (more details in the paper).
### Model Variants
- [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
- [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
- [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
### Model Performance
<p align="center">
<img src="./performance_focused.png" width="550">
</p>
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
from 11 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-DeBERTa-v3-Large outperform all
exisiting specialized fact-checkers with a similar scale. See full results in our work.
Note: We only evaluated the performance of our models on real claims -- without any human intervention in
any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
LLMs' actual behaviors.
# Model Usage Demo
Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`.
### Below is a simple use case
```python
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
doc = "A group of students gather in the school library to study for their upcoming final exams."
claim_1 = "The students are preparing for an examination."
claim_2 = "The students are on vacation."
# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
print(pred_label) # [1, 0]
print(raw_prob) # [0.9786180257797241, 0.01138285268098116]
```
### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark
```python
import pandas as pd
from datasets import load_dataset
from minicheck.minicheck import MiniCheck
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# load 29K test data
df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
docs = df.doc.values
claims = df.claim.values
scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 800 docs/min, depending on hardware
```
To evalaute the result on the benchmark
```python
from sklearn.metrics import balanced_accuracy_score
df['preds'] = pred_label
result_df = pd.DataFrame(columns=['Dataset', 'BAcc'])
for dataset in df.dataset.unique():
sub_df = df[df.dataset == dataset]
bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100
result_df.loc[len(result_df)] = [dataset, bacc]
result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()]
result_df.round(1)
```
# Citation
```
@misc{tang2024minicheck,
title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents},
author={Liyan Tang and Philippe Laban and Greg Durrett},
year={2024},
eprint={2404.10774},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |