The reference-based LLM hallucination detector

The LLM hallucination detector based on the self-adaptive hierarchical XLM-RoBERTa-XL was developed to participate in the SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes (model-agnostic track).

Model description

This model was component of my solution for the SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The competition goal was to develop of the best algorithm to detect a LLM's hallucination, i.e. grammatically sound output that contains incorrect semantic information (unsupported or inconsistent with the source input). The competition organizers prepared two different setups:

the model-aware track, where the developed detector can have access to the LLM that produced the output;
the model-agnostic track, where the developed detector have not any access to the verified LLM, and it uses only the source input of this LLM and generated output.

This model was designed to detect hallucinations in the model-agnostic track. It was based on fourth simple ideas:

The detector is a Transformer-encoder based on XLM-RoBERTa-XL. The hallucination detection is a binary classification problem, and we don't need any decoder like as SelfCheckGPT to solve this problem. Make encoder great again!
The prompt engineering matters for encoders too (not for decoders only). A specially designed text prompt is a good inductive bias for a text classifier.
The text classifier needs a self-adaptive hierarchy of encoder's hidden layers. The classifier does not have to be built on top of the last hidden layer. Perhaps one of the earlier hidden layers would be more useful. We don't know this for sure, so we use a special gating network to automatically estimate the importance of encoder's various hidden layers during training.
A two-stage fine-tuning is all you need. At the first stage we fine-tune our self-adaptive hierarchical encoder as a sentence embedder using contrastive learning. At the second stage we fine-tune this model as a usual classifier from the embedder's checkpoint. This approach was proposed in the paper "Contrastive fine-tuning to improve generalization in deep NER" (DOI: 10.28995/2075-7182-2022-21-70-80), but it works for other NLU tasks too.

Intended uses & limitations

This model is primarily aimed at being reference-based detected of hallucination in LLM without any additional information about LLM type and architecture (i.e. in model-agnostic mode). The reference-based detection means that the hallucination detector considers not only the human question and the answer generated by the verified LLM, but also the reference answer to the human question. Therefore, in a situation where the reference answer is not known, this hallucination detector is not applicable. But in some cases (for example, when we analyze the LLM's responses on an annotated test set and want to separate hallucinations from usual errors such as undergeneration, errors related to part of speech, and so on), we know information about the standards, and then the proposed detector will be extremely useful.

This model is capable of detecting LLM hallucinations that occur when solving the following NLG tasks: paraphrase generation, machine translation, and definition modeling.

Evaluation

The final ranking of all model-agnostic solutions on the test data is available in the ranking agnostic CSV file on the SHROOM web-page. The accuracy score of my solution is 0.77, and it ranks 28th out of 49. The abovementioned model is a component of my solution, and the accuracy of this model as an independent algorithm is 0.7153. For comparison, the accuracy of the baseline system based on SelfCheckGPT is 0.6967.

Usage

You need to install the pytorch-metric-library to use this model. After that, you can use this model directly with a pipeline for text classification:

from typing import Dict

from transformers import pipeline
import torch


def sample_to_str(sample: Dict[str, str]) -> str:
    """ It converts a datapoint to an input text for an encoder-based classifier (like as RoBERTa).
    :param sample: the datapoint
    :return: the input text for the classifier (i.e. the LLM hallucination detector).
    """
    possible_tasks = {
        'PG',  # paraphrase generation
        'MT',  # machine translation
        'DM',  # definition modeling
    }
    checked_llm_prediction = ' '.join(sample['hyp'].strip().split())
    llm_task = sample['task']
    if llm_task not in possible_tasks:
        raise ValueError(f'The task {llm_task} is not supported!')
    if llm_task == 'PG':
        context = ' '.join(sample['src'].strip().split())
        united_prompt = 'The verified system\'s task is a paraphrase generation.'
    else:
        context = ' '.join(sample['tgt'].strip().split())
        if llm_task== 'MT':
            united_prompt = 'The verified system\'s task is a machine translation.'
        else:
            united_prompt = 'The verified system\'s task is a definition modeling.'
    united_prompt += ' The sentence generated by the verified system: '
    united_prompt += checked_llm_prediction
    if united_prompt[-1].isalnum():
        united_prompt += '.'
    united_prompt += f' The generation context: {context}'
    if united_prompt[-1].isalnum():
        united_prompt += '.'
    return united_prompt


# The input data format is based on data for the model-agnostic track of SHROOM
# https://helsinki-nlp.github.io/shroom
# "src" is a verified LLM's input to start generation
# "hyp" is an output generated by this LLM
# "tgt" is a reference output from the point of view of human assessors
input_data = [
    {
        "hyp": "Resembling or characteristic of a weasel.",
        "ref": "tgt",
        "src": "The writer had just entered into his eighteenth year , when he met at the table of a certain Anglo - Germanist an individual , apparently somewhat under thirty , of middle stature , a thin and <define> weaselly </define> figure , a sallow complexion , a certain obliquity of vision , and a large pair of spectacles .",
        "tgt": "Resembling a weasel (in appearance).",
        "model": "",
        "task": "DM",
        "labels": [
            "Hallucination",
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination"
        ],
        "label": "Not Hallucination",
        "p(Hallucination)": 0.2
    },
    {
        "hyp": "I thought you'd be surprised at me too.",
        "ref": "either",
        "src": "I thought so, too.",
        "tgt": "That was my general impression as well.",
        "model": "",
        "task": "PG",
        "labels": [
            "Hallucination",
            "Hallucination",
            "Hallucination",
            "Hallucination",
            "Hallucination"
        ],
        "label": "Hallucination",
        "p(Hallucination)": 1.0
    },
    {
        "hyp": "You can go with me perfectly.",
        "ref": "either",
        "src": "Ты вполне можешь пойти со мной.",
        "tgt": "You may as well come with me.",
        "model": "",
        "task": "MT",
        "labels": [
            "Not Hallucination",
            "Hallucination",
            "Hallucination",
            "Not Hallucination",
            "Hallucination"
        ],
        "label": "Hallucination",
        "p(Hallucination)": 0.6
    }
]

hallucination_detector = pipeline(
    task='text-classification',
    model='bond005/xlm-roberta-xl-hallucination-detector',
    framework='pt', trust_remote_code=True, device='cuda', torch_dtype=torch.float16
)

for sample in input_data:
    input_prompt = sample_to_str(sample)
    print('')
    print('==========')
    print(f' Task: {sample["task"]}')
    print(' Question for detector:')
    print(input_prompt)
    print('==========')
    print('TRUE')
    print(f'    label:            {sample["label"]}')
    print(f'    p(Hallucination): {round(sample["p(Hallucination)"], 3)}')
    prediction = hallucination_detector(input_prompt)[0]
    predicted_label = prediction['label']
    if predicted_label == 'Hallucination':
        hallucination_probability = prediction['score']
    else:
        hallucination_probability = 1.0 - prediction['score']
    print('PREDICTED')
    print(f'    label:            {predicted_label}')
    print(f'    p(Hallucination): {round(hallucination_probability, 3)}')


==========
 Task: DM
 Question for detector:
The verified system's task is a definition modeling. The sentence generated by the verified system: Resembling or characteristic of a weasel. The generation context: Resembling a weasel (in appearance).
==========
TRUE
    label:            Not Hallucination
    p(Hallucination): 0.2
PREDICTED
    label:            Not Hallucination
    p(Hallucination): 0.297

==========
 Task: PG
 Question for detector:
The verified system's task is a paraphrase generation. The sentence generated by the verified system: I thought you'd be surprised at me too. The generation context: I thought so, too.
==========
TRUE
    label:            Hallucination
    p(Hallucination): 1.0
PREDICTED
    label:            Hallucination
    p(Hallucination): 0.563

==========
 Task: MT
 Question for detector:
The verified system's task is a machine translation. The sentence generated by the verified system: You can go with me perfectly. The generation context: You may as well come with me.
==========
TRUE
    label:            Hallucination
    p(Hallucination): 0.6
PREDICTED
    label:            Not Hallucination
    p(Hallucination): 0.487

The Google Colaboratory version of this script is available too.

Citation

If you want to cite this model you can use this:

@misc{bondarenko2024hallucination,
  title={The reference-based detector of LLM hallucinations by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/xlm-roberta-xl-hallucination-detector}},
  year={2024}
}