Spaces:
Runtime error
Runtime error
November 8, 2005 | |
We found a bug in the calculation of the bpref measure within trec_eval. | |
BUG DESCRIPTION: The bpref measure calculates the fraction of | |
preferences between pairs of judged relevant and non-relevant | |
documents that were correctly ordered in a document ranking. | |
When a run does not retrieve R judged non-relevant documents, | |
only the retrieved non-relevant documents were considered. Thus | |
a (worst case) run which retrieved only 5 judged documents, the | |
first non-relevant and the following 4 relevant, would have a | |
score of 0.0 since the fraction of correct preferences among the | |
retrieved judged documents was 0.0. However, the retrieved judged | |
relevant documents should have been counted as being preferred | |
over any judged non-relevant document that wasn't retrieved. If | |
the nonretrieved documents included 3 judged non-relevant | |
documents and 2 judged relevant documents, then the bpref score | |
should be 0.5. (= ((4 * (3/4)) + 2 * (0/4)) / 6). | |
BUG IMPACT: Almost no impact for standard TREC-type ad hoc runs | |
(retrieve 1000 documents). Topics with large numbers of relevant | |
documents (eg, over 300) had their scores artificially depressed | |
for those topics, and thus performance with the corrected bpref | |
will be higher on those topics. Kendall tau of system rankings | |
show very strong (.95 - .98) agreement between the buggy and new | |
bpref. | |
There may be more impact for non-standard environments where the | |
number of retrieved judged documents is small. Eg, I've been | |
told the 2005 Terabyte efficiency track (only retrieve 20 | |
documents) is more strongly affected. | |
BUG FIX: Version 8.0, available from the usual places at NIST, | |
implements the corrected bpref calculations. It also adds the | |
measures "old_bpref" and "old_bpref_top10pRnonrel" that calculate | |
the buggy numbers for comparisons with old results (the latter | |
measure was used in the SIGIR 2004 bpref paper). People using | |
bpref should switch to Version 8.0 or higher as soon as possible. | |
BUG APOLOGY: I want to apologize to the community for the error. | |
Doing research and using new measures is hard enough without | |
having to worry about buggy implementations! | |
Chris Buckley | |
####################################################################### | |
For those of you working with bpref who want to know more details | |
about the bug and its effects, here's a fuller version of above. | |
BUG DESCRIPTION: Here's code, pseudo_code, and comments comparing | |
old_bpref (the buggy version) and bpref on a single topic: | |
long nonrel_so_far; /* Number of non-relevant documents seen while | |
going through the ranking */ | |
long num_nonrel; /* Number of judged non-relevant documents */ | |
long nonrel_ret; /* Number of retrieved judged non-relevant documents */ | |
long pref_top_Rnonrel_num; /* set to R (eval->num_rel) */ | |
nonrel_so_far = 0; | |
foreach doc in retrieved documents (sorted in decreasing score) { | |
if (doc is not relevant) | |
nonrel_so_far++; | |
else { | |
/* Add fraction of correct preferences for this doc */ | |
if (nonrel_so_far) { | |
-->new eval->bpref += 1.0 - | |
(((float) MIN (nonrel_so_far, pref_top_Rnonrel_num)) / | |
(float) MIN (num_nonrel, pref_top_Rnonrel_num)); | |
-->buggy eval->old_bpref += 1.0 - | |
(((float) MIN (nonrel_so_far, pref_top_Rnonrel_num)) / | |
(float) MIN (nonrel_ret, pref_top_Rnonrel_num)); | |
} | |
else { | |
eval->bpref += 1.0; | |
eval->old_bpref += 1.0; | |
} | |
} | |
} | |
if (eval->num_rel) { | |
eval->bpref /= eval->num_rel; | |
eval->old_bpref /= eval->num_rel; | |
} | |
BUG HOW IT HAPPENED:The first versions of bpref I wrote were | |
defined only on the retrieved documents. When I switched to | |
variants using the TREC standard of considering all documents | |
(which improved the measures greatly), I overlooked making the | |
needed change to the denominator in the code above. | |
BUG HOW DISCOVERED: On November 4, Ian Soboroff (NIST) was | |
looking at the bpref code and couldn't understand the "corner" | |
conditions in the code. We talked some, I stared at the code a | |
couple of minutes, and then went "OOPS", (or words to that | |
effect). | |
CHANGES CAUSED BY BUG: I made the obvious changes to the bpref | |
code, and also revamped the whole structure of the main function, | |
to at least break apart the major different kinds of measures | |
(eg, compute the cutoff measures, the per document average | |
measures, the bpref measures, and the time measures separately.) | |
It was impossible to understand the function and now it's merely | |
very difficult! I should probably rewrite it to compute most | |
measures separately; execution speed is no longer the critical | |
factor it once was. | |
BUG IMPACT VALIDATION: I reran the complete set of runs that went | |
into the Buckley, Voorhees SIGIR 2004 bpref paper. That included | |
comparing all systems in tasks in TREC 8, TREC 10, and TREC 12 at | |
various levels of completeness of the document set and relevance | |
levels. In the comparisons of bpref system rankings versus | |
original MAP rankings, the Kendall Tau scores of the two versions | |
of bpref were basically identical. They did not vary from each | |
other by more than .01, except when only using 1% or 2% of the | |
judgements in which case it was less than .03. (I was actually | |
expecting much greater differences when using very small number | |
of relevant and non-relevant documents. But it looks like it was | |
the same for all systems.) | |
Using full information, the actual scores of the old and new bprefs | |
were pretty much the same when averaged over all systems, except for 3 | |
topics in TREC 10. Here's the topics with the top differences in old | |
and new bpref scores when averaged over all systems: | |
qid old_bpref bpref diff | |
541 0.189271 0.324097 -0.134826 | |
544 0.555475 0.633235 -0.07776 | |
549 0.278118 0.321332 -0.043214 | |
530 0.391955 0.394681 -0.002726 | |
519 0.132094 0.132671 -0.000577 | |
509 0.266160 0.266544 -0.000384 | |
547 0.167335 0.167504 -0.000169 | |
511 0.303595 0.303679 -8.4e-05 | |
501 0.175508 0.175508 0 ... all other topics tied at 0 | |
Here's the number of relevant documents per topic | |
num_rel 541 372 | |
num_rel 549 367 | |
num_rel 544 324 | |
num_rel 511 165 | |
num_rel 519 149 | |
num_rel 547 144 | |
num_rel 509 140 | |
num_rel 530 124 | |
num_rel 527 93 | |
Clearly there's a big impact in scores on the 3 topics with over | |
300 relevant documents, a small impact on the 5 topics with | |
between 100 and 200 relevant documents, and no impact on the | |
rest. | |
For TREC 8, there was 1 topic with a diff greater than .01 (.029) | |
and 23 topics that had any differences at all. For TREC 12, | |
there was a small impact on 32/100 topics with the largest being | |
.008. | |
Overall, I conclude that there's a minor impact on the standard | |
TREC 1000 document evaluations due to the buggy bpref on topics | |
which have hundreds of relevant documents. The average scores of | |
all systems will change because of these topics, but it should | |
not have an important effect on system ranking (except possibly | |
for systems which consistently retrieve less than 1000 | |
documents). | |
I sent trec_eval Version 8.0beta to Ian to run on this year's bpref | |
oriented runs. His report was that there was strong agreement in | |
Kendall Tau between the buggy and nonbuggy versions, and when you | |
compared each against MAP, they were within .016 of each other (closer | |
depending on task). It actually was the buggy version that tracked | |
MAP slightly more closely, perhaps indicating that MAP emphasize | |
topics with lots of relevant documents a bit less than bpref. | |
Ian reported the runs on Terabyte efficiency track, where systems only | |
retrieved 20 documents per topic (running 50,000 topics but evaluating | |
over 50), had much larger bpref differences in score (the new bpref | |
average scores being 50% higher than the old), but still had a 90% | |
Kendall Tau between the versions; about the same that either had with | |
MAP. That's good enough to reassure me that most conclusions people | |
have reached in experiments with bpref will still be valid, though | |
the numbers will have to be redone. | |