ask_my_thesis / assets /txts /pg_0057.txt
jordyvl's picture
First commit
e0a78f5
raw
history blame
1.89 kB
RELIABILITY AND ROBUSTNESS
25
Consider for simplicity, the evaluation of a single non-list ground truth answer
G and prediction P̂ , each with string lengths |G| and |P̂ |, respectively.
1 if NA(G) ∧ |P̂ | > 0,
0 if NA(G) ∧ |P̂ | = 0,
 |G| if |P̂ | = 0,
LD(G, P̂ ) =
LD(tail(G), tail(P̂ )) if G[0] = P̂ [0],
if G[0] 6= P̂ [0] (deletion),
 LD(tail(G), P̂ )
1 + min
LD(G, tail(P̂ ))
if G[0] 6= P̂ [0] (insertion),
LD(tail(G), tail(P̂ )) if G[0] 6= P̂ [0] (substitution)
(2.7)
Each of the conditions is tested in turn, and the first one that is true is executed.
The normalized similarity metric is then defined as
NLS(G, P̂ ) =
1 − LD(G, P̂ )
max(1, |G|, |P̂ |)
.
Given multiple ground truth answer variants G = {a1 , a2 , ...} and a predicted
answer for P̂Qi for each question Q in the test set of size N , we define the
complete metric as follows:
N 


1 X
ANLS =
max s a, P̂Qi
N i=1 a∈Gi


s a, P̂Qi =


 NLS a, P̂Q
i
 0


if NLS a, P̂Qi > τ


,
if NLS a, P̂Qi < τ
(2.8)
(2.9)
where we follow prior literature [39, 449] in setting the threshold τ = 0.5.
In the case of a list-type question, Hungarian matching is performed following
[449] according to NLS between each ground truth answer part and each
prediction answer part.
Proper scoring rules [330] are used for generic evaluation of predictive
performance, which calculate scoring at the instance-level while measuring both
the quality of the predictive function and predicted probability distribution (as
they are not compatible with an arbitrary CSF):
• Negative Log Likelihood (NLL) [378] is both a popular loss function
(cross-entropy) and scoring rule which only penalizes (wrong) log
probabilities qi given to the true class, with I an indicator function defining