Spaces:
Paused
Paused
RELIABILITY AND ROBUSTNESS | |
25 | |
Consider for simplicity, the evaluation of a single non-list ground truth answer | |
G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively. | |
| |
1 if NA(G) ∧ |P̂ | > 0, | |
| |
| |
| |
| |
| |
0 if NA(G) ∧ |P̂ | = 0, | |
| |
| |
| |
| |
|G| if |P̂ | = 0, | |
LD(G, P̂ ) = | |
LD(tail(G), tail(P̂ )) if G[0] = P̂ [0], | |
| |
| |
| |
| |
if G[0] 6= P̂ [0] (deletion), | |
LD(tail(G), P̂ ) | |
| |
| |
| |
| |
1 + min | |
LD(G, tail(P̂ )) | |
if G[0] 6= P̂ [0] (insertion), | |
| |
| |
| |
| |
LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution) | |
(2.7) | |
Each of the conditions is tested in turn, and the first one that is true is executed. | |
The normalized similarity metric is then defined as | |
NLS(G, P̂ ) = | |
1 − LD(G, P̂ ) | |
max(1, |G|, |P̂ |) | |
. | |
Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted | |
answer for P̂Qi for each question Q in the test set of size N , we define the | |
complete metric as follows: | |
N | |
1 X | |
ANLS = | |
max s a, P̂Qi | |
N i=1 a∈Gi | |
s a, P̂Qi = | |
| |
NLS a, P̂Q | |
i | |
0 | |
if NLS a, P̂Qi > τ | |
, | |
if NLS a, P̂Qi < τ | |
(2.8) | |
(2.9) | |
where we follow prior literature [39, 449] in setting the threshold τ = 0.5. | |
In the case of a list-type question, Hungarian matching is performed following | |
[449] according to NLS between each ground truth answer part and each | |
prediction answer part. | |
Proper scoring rules [330] are used for generic evaluation of predictive | |
performance, which calculate scoring at the instance-level while measuring both | |
the quality of the predictive function and predicted probability distribution (as | |
they are not compatible with an arbitrary CSF): | |
• Negative Log Likelihood (NLL) [378] is both a popular loss function | |
(cross-entropy) and scoring rule which only penalizes (wrong) log | |
probabilities qi given to the true class, with I an indicator function defining | |