Spaces:

geonmin-kim
/

NetsPresso_QA

Runtime error

App Files Files Community

NetsPresso_QA / tools /eval /trec_eval.9.0.4 /bpref_bug

geonmin-kim

Upload folder using huggingface_hub

d6585f5 about 1 year ago

raw

history blame

No virus

8.26 kB

	November 8, 2005
	We found a bug in the calculation of the bpref measure within trec_eval.

	BUG DESCRIPTION: The bpref measure calculates the fraction of
	preferences between pairs of judged relevant and non-relevant
	documents that were correctly ordered in a document ranking.
	When a run does not retrieve R judged non-relevant documents,
	only the retrieved non-relevant documents were considered. Thus
	a (worst case) run which retrieved only 5 judged documents, the
	first non-relevant and the following 4 relevant, would have a
	score of 0.0 since the fraction of correct preferences among the
	retrieved judged documents was 0.0. However, the retrieved judged
	relevant documents should have been counted as being preferred
	over any judged non-relevant document that wasn't retrieved. If
	the nonretrieved documents included 3 judged non-relevant
	documents and 2 judged relevant documents, then the bpref score
	should be 0.5. (= ((4 * (3/4)) + 2 * (0/4)) / 6).

	BUG IMPACT: Almost no impact for standard TREC-type ad hoc runs
	(retrieve 1000 documents). Topics with large numbers of relevant
	documents (eg, over 300) had their scores artificially depressed
	for those topics, and thus performance with the corrected bpref
	will be higher on those topics. Kendall tau of system rankings
	show very strong (.95 - .98) agreement between the buggy and new
	bpref.

	There may be more impact for non-standard environments where the
	number of retrieved judged documents is small. Eg, I've been
	told the 2005 Terabyte efficiency track (only retrieve 20
	documents) is more strongly affected.

	BUG FIX: Version 8.0, available from the usual places at NIST,
	implements the corrected bpref calculations. It also adds the
	measures "old_bpref" and "old_bpref_top10pRnonrel" that calculate
	the buggy numbers for comparisons with old results (the latter
	measure was used in the SIGIR 2004 bpref paper). People using
	bpref should switch to Version 8.0 or higher as soon as possible.

	BUG APOLOGY: I want to apologize to the community for the error.
	Doing research and using new measures is hard enough without
	having to worry about buggy implementations!

	Chris Buckley

	#######################################################################
	For those of you working with bpref who want to know more details
	about the bug and its effects, here's a fuller version of above.

	BUG DESCRIPTION: Here's code, pseudo_code, and comments comparing
	old_bpref (the buggy version) and bpref on a single topic:
	long nonrel_so_far; /* Number of non-relevant documents seen while
	going through the ranking */
	long num_nonrel; /* Number of judged non-relevant documents */
	long nonrel_ret; /* Number of retrieved judged non-relevant documents */
	long pref_top_Rnonrel_num; /* set to R (eval->num_rel) */

	nonrel_so_far = 0;
	foreach doc in retrieved documents (sorted in decreasing score) {
	if (doc is not relevant)
	nonrel_so_far++;
	else {
	/* Add fraction of correct preferences for this doc */
	if (nonrel_so_far) {
	-->new eval->bpref += 1.0 -
	(((float) MIN (nonrel_so_far, pref_top_Rnonrel_num)) /
	(float) MIN (num_nonrel, pref_top_Rnonrel_num));
	-->buggy eval->old_bpref += 1.0 -
	(((float) MIN (nonrel_so_far, pref_top_Rnonrel_num)) /
	(float) MIN (nonrel_ret, pref_top_Rnonrel_num));
	}
	else {
	eval->bpref += 1.0;
	eval->old_bpref += 1.0;
	}
	}
	}
	if (eval->num_rel) {
	eval->bpref /= eval->num_rel;
	eval->old_bpref /= eval->num_rel;
	}

	BUG HOW IT HAPPENED:The first versions of bpref I wrote were
	defined only on the retrieved documents. When I switched to
	variants using the TREC standard of considering all documents
	(which improved the measures greatly), I overlooked making the
	needed change to the denominator in the code above.

	BUG HOW DISCOVERED: On November 4, Ian Soboroff (NIST) was
	looking at the bpref code and couldn't understand the "corner"
	conditions in the code. We talked some, I stared at the code a
	couple of minutes, and then went "OOPS", (or words to that
	effect).

	CHANGES CAUSED BY BUG: I made the obvious changes to the bpref
	code, and also revamped the whole structure of the main function,
	to at least break apart the major different kinds of measures
	(eg, compute the cutoff measures, the per document average
	measures, the bpref measures, and the time measures separately.)
	It was impossible to understand the function and now it's merely
	very difficult! I should probably rewrite it to compute most
	measures separately; execution speed is no longer the critical
	factor it once was.

	BUG IMPACT VALIDATION: I reran the complete set of runs that went
	into the Buckley, Voorhees SIGIR 2004 bpref paper. That included
	comparing all systems in tasks in TREC 8, TREC 10, and TREC 12 at
	various levels of completeness of the document set and relevance
	levels. In the comparisons of bpref system rankings versus
	original MAP rankings, the Kendall Tau scores of the two versions
	of bpref were basically identical. They did not vary from each
	other by more than .01, except when only using 1% or 2% of the
	judgements in which case it was less than .03. (I was actually
	expecting much greater differences when using very small number
	of relevant and non-relevant documents. But it looks like it was
	the same for all systems.)

	Using full information, the actual scores of the old and new bprefs
	were pretty much the same when averaged over all systems, except for 3
	topics in TREC 10. Here's the topics with the top differences in old
	and new bpref scores when averaged over all systems:
	qid old_bpref bpref diff
	541 0.189271 0.324097 -0.134826
	544 0.555475 0.633235 -0.07776
	549 0.278118 0.321332 -0.043214
	530 0.391955 0.394681 -0.002726
	519 0.132094 0.132671 -0.000577
	509 0.266160 0.266544 -0.000384
	547 0.167335 0.167504 -0.000169
	511 0.303595 0.303679 -8.4e-05
	501 0.175508 0.175508 0 ... all other topics tied at 0
	Here's the number of relevant documents per topic
	num_rel 541 372
	num_rel 549 367
	num_rel 544 324
	num_rel 511 165
	num_rel 519 149
	num_rel 547 144
	num_rel 509 140
	num_rel 530 124
	num_rel 527 93
	Clearly there's a big impact in scores on the 3 topics with over
	300 relevant documents, a small impact on the 5 topics with
	between 100 and 200 relevant documents, and no impact on the
	rest.

	For TREC 8, there was 1 topic with a diff greater than .01 (.029)
	and 23 topics that had any differences at all. For TREC 12,
	there was a small impact on 32/100 topics with the largest being
	.008.


	Overall, I conclude that there's a minor impact on the standard
	TREC 1000 document evaluations due to the buggy bpref on topics
	which have hundreds of relevant documents. The average scores of
	all systems will change because of these topics, but it should
	not have an important effect on system ranking (except possibly
	for systems which consistently retrieve less than 1000
	documents).

	I sent trec_eval Version 8.0beta to Ian to run on this year's bpref
	oriented runs. His report was that there was strong agreement in
	Kendall Tau between the buggy and nonbuggy versions, and when you
	compared each against MAP, they were within .016 of each other (closer
	depending on task). It actually was the buggy version that tracked
	MAP slightly more closely, perhaps indicating that MAP emphasize
	topics with lots of relevant documents a bit less than bpref.

	Ian reported the runs on Terabyte efficiency track, where systems only
	retrieved 20 documents per topic (running 50,000 topics but evaluating
	over 50), had much larger bpref differences in score (the new bpref
	average scores being 50% higher than the old), but still had a 90%
	Kendall Tau between the versions; about the same that either had with
	MAP. That's good enough to reassure me that most conclusions people
	have reached in experiments with bpref will still be valid, though
	the numbers will have to be redone.