File size: 8,567 Bytes
d871986
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# SHROOM validation set
This archive corresponds to the validation data for the SHROOM task 6 at Semeval 2024 (Shared-task on Hallucinations and Observable Overgeneration Mistakes).

**NB:** This entire README is adapted fronm the trial README, most of the information it contains should not be new.

## What is SHROOM?
The task consists in a binary classification, where participants are asked to determine whether a given production from an NLP model constitutes a hallucination

Participants will be ranked along two metrics: (i) accuracy and (ii) how well their probability correlates with the empirical probabilities observed in our annotators.

## File format
The files are formatted as a JSON list. Each element in this list corresponds to a datapoint.

Each datapoint corresponds to a different model production, and contains the following information:
- a task (`task`), indicating what objective the model was optimized for;
- a source (`src`), the input passed to the models for generation;
- a target (`tgt`), the intended reference "gold" text that the model ought to generate;
- a hypothesis (`hyp`), the actual model production;
- a set of per annotator labels (`labels`), indicating whether each individual annotator thought this datapoint constituted a hallucination or not;
- a majority-based gold-label (`label`), based on the previous per-annotator labels;
- a probability assigned to this datapoint being a hallucination (`p(Hallucination)`), corresponding to the proportion of annotators who considered this specific datapoint to be a hallucination.

We also include an indicator of whether target or source should serve as a semantic reference (`ref`): in some NLP tasks, such as Definition Modeling, the source may not contain the information necessary to establish whether the model production is factual wrong whereas in other cases, such as with Text Simplification, the same holds for the target. The `ref` key therefore indicate whether target, source or both of these fields contain the semantic information necessary to establish whether a datapoint is a hallucination.

Lastly, the model-aware file also identifies the model used to produced each datapoint, as a huggingface identifier (`model`).

#### Example: interpreting a Definition Modeling (DM) datapoint

The definition modeling task was introduced in [Noraset et al (2017)](https://dl.acm.org/doi/10.5555/3298023.3298042). In this task, models are trained to generate a definition for a given example in context.

**For model-agnostic datapoints,** we are specifically using the scheme of [Bevilacqua et al (2020)](https://aclanthology.org/2020.emnlp-main.585/). The source (`"src"`) corresponds to the context; the word to define is indicated using two special tokens `<define>` ... `</define>`.The target (`"tgt"`) is the intended definition for this context (as found in wiktionary); the hypothesis (`"hyp"`) is the actual model production.

To take a concrete example, the following datapoint in the trial set:

```json
    {
        "hyp": "(uncountable) The study of trees.",
        "ref": "tgt",
        "src": "It is now generally supposed that the forbidden fruit was a kind of citrus , but certain facts connected with <define> arborolatry </define> seem to me to disprove this opinion .",
        "tgt": "The worship of trees.",
        "model": "",
        "task": "DM",
        "labels": [
            "Hallucination",
            "Hallucination",
            "Hallucination"
        ],
        "label": "Hallucination",
        "p(Hallucination)": 1.0
    }
```

This corresponds to defining the word "arborolatry" (delinated by the `<define>` and `</define>` control tokens) in the following context (corresponding to the `src` key) : 
 + _It is now generally supposed that the forbidden fruit was a kind of citrus , but certain facts connected with arborolatry seem to me to disprove this opinion._

The model produced the following hypothesis ('hyp' key):
 + `(uncountable) The study of trees.`
 
whereas the gold definition from wiktionary ('tgt' key) is as follows:
 + _The worship of trees._

Annotators then marked whether this production is considered a hallucination or not. To do so, we asked them to study whether the hypothesis (`hyp` key) contains information that is not supported by the reference. Here, the `ref` key indicates that this reference corresponds to the target (given by its value, `"tgt"`). All three annotators considered the production to be a hallucination (cf. the `labels` key).

**For model-aware datapoints,** we rely on the work of [Giulianelli et al (2023)](https://aclanthology.org/2023.acl-long.176). The only field that differs is the source; all other fields have the same interpretation as for model-aganostic DM datapoints, with an added `model` field to indicate the huggingface identifier of the model. In the case of model aware , the source (`"src"`) corresponds to the context followed by a query for the meaning of the headword. 

To take a concrete example, consider the following validation datapoint:
```json
    {
        "hyp": "To react too much .",
        "ref": "tgt",
        "src": "Please try not to overreact if she drives badly when she is first learning . What is the meaning of overreact ?",
        "tgt": "To react too much or too intensely .",
        "model": "ltg/flan-t5-definition-en-base",
        "task": "DM",
        "labels": [
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination"
        ],
        "label": "Not Hallucination",
        "p(Hallucination)": 0.0
    }
```

Here, the source (`src`) indicates that the word to be defined is "overreact", as in the context "Please try not to overreact if she drives badly when she is first learning."


#### Example: interpreting a Paraphrase Generation (PG) datapoint

The same structure holds for the paraphrase generation (PG) task. For an example, consider the following trial datapoint:

```json
    {
        "hyp": "When did you see him?",
        "ref": "either",
        "src": "When\u2019d you last see him?",
        "tgt": "When was the last time you saw him?",
        "model": "tuner007/pegasus_paraphrase",
        "task": "PG",
        "labels": [
            "Not Hallucination",
            "Not Hallucination",
            "Not Hallucination"
        ],
        "label": "Not Hallucination",
        "p(Hallucination)": 0.0
    }
```

Using the following input (`src` key):
 + _When’d you last see him?_

the  model production (listed under the `hyp` key) was as follows:
 + `When did you see him?`

whereas the intended gold target (`tgt` key) was:
 + _When was the last time you saw him?_
 
All three annotators did not consider this production as hallucinatory (cf. the `labels` key). To do so, they were instructed to look whether all information stated in the hypothesis was supported by either/both the source and the target (as explicited with the `"either"` value of the `ref` key).

For PG datapoints, we also indicate the huggingface model that was used to generate the hypothesis, see the `model` key. 

#### Example: interpreting a Machine Translation (MT) datapoint

The structure of MT datapoints is consistent with PG and DM. For instance:

```json
    {
        "hyp": "I have nothing to do with it.",
        "ref": "either",
        "src": "J'en ai rien \u00e0 secouer.",
        "tgt": "I don't give a shit about it.",
        "model": "",
        "task": "MT",
        "labels": [
            "Hallucination",
            "Not Hallucination",
            "Hallucination"
        ],
        "label": "Hallucination",
        "p(Hallucination)": 0.6666666666666666
    }
```

In the above datapoint, the model was tasked with translating the source (`src`) "_J'en ai rien à secouer._"; the expected target gold translation (`tgt`) was "_I don't give a shit about it._"

Instead, the model produced the following (`hyp`):
+ `I have nothing to do with it.`

Two out of three annotators considered this production a hallucination (`labels` key), based either/both the source and the target (as explicited with the `"either"` value of the `ref` key). The majority label (`label` key) is therefore `"Hallucination"`. 

## How does this validation dataset differ from the trial, train and test sets?
All dataset splits cover datapoints from definition modeling (DM), machine translation (MT) and paraphrase generation (PG).

Furthermore, the train set will not contain manual annotations.