trec_eval / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: 8b9373dc)
b56e99c
metadata
title: TREC Eval
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  The TREC Eval metric combines a number of information retrieval metrics such
  as precision and nDCG. It is used to score rankings of retrieved documents
  with reference values.

Metric Card for TREC Eval

Metric Description

The TREC Eval metric combines a number of information retrieval metrics such as precision and normalized Discounted Cumulative Gain (nDCG). It is used to score rankings of retrieved documents with reference values.

How to Use

from evaluate import load
trec_eval = load("trec_eval")
results = trec_eval.compute(predictions=[run], references=[qrel])

Inputs

  • predictions (dict): a single retrieval run.
    • query (int): Query ID.
    • q0 (str): Literal "q0".
    • docid (str): Document ID.
    • rank (int): Rank of document.
    • score (float): Score of document.
    • system (str): Tag for current run.
  • references (dict): a single qrel.
    • query (int): Query ID.
    • q0 (str): Literal "q0".
    • docid (str): Document ID.
    • rel (int): Relevance of document.

Output Values

  • runid (str): Run name.
  • num_ret (int): Number of retrieved documents.
  • num_rel (int): Number of relevant documents.
  • num_rel_ret (int): Number of retrieved relevant documents.
  • num_q (int): Number of queries.
  • map (float): Mean average precision.
  • gm_map (float): geometric mean average precision.
  • bpref (float): binary preference score.
  • Rprec (float): precision@R, where R is number of relevant documents.
  • recip_rank (float): reciprocal rank
  • P@k (float): precision@k (k in [5, 10, 15, 20, 30, 100, 200, 500, 1000]).
  • NDCG@k (float): nDCG@k (k in [5, 10, 15, 20, 30, 100, 200, 500, 1000]).

Examples

A minimal example of looks as follows:

qrel = {
    "query": [0],
    "q0": ["q0"],
    "docid": ["doc_1"],
    "rel": [2]
}
run = {
    "query": [0, 0],
    "q0": ["q0", "q0"],
    "docid": ["doc_2", "doc_1"],
    "rank": [0, 1],
    "score": [1.5, 1.2],
    "system": ["test", "test"]
}

trec_eval = evaluate.load("trec_eval")
results = trec_eval.compute(references=[qrel], predictions=[run])
results["P@5"]
0.2

A more realistic use case with an examples from trectools:

qrel = pd.read_csv("robust03_qrels.txt", sep="\s+", names=["query", "q0", "docid", "rel"])
qrel["q0"] = qrel["q0"].astype(str)
qrel = qrel.to_dict(orient="list")

run = pd.read_csv("input.InexpC2", sep="\s+", names=["query", "q0", "docid", "rank", "score", "system"])
run = run.to_dict(orient="list")

trec_eval = evaluate.load("trec_eval")
result = trec_eval.compute(run=[run], qrel=[qrel])
result

{'runid': 'InexpC2',
 'num_ret': 100000,
 'num_rel': 6074,
 'num_rel_ret': 3198,
 'num_q': 100,
 'map': 0.22485930431817494,
 'gm_map': 0.10411523825735523,
 'bpref': 0.217511695914079,
 'Rprec': 0.2502547201167236,
 'recip_rank': 0.6646545943335417,
 'P@5': 0.44,
 'P@10': 0.37,
 'P@15': 0.34600000000000003,
 'P@20': 0.30999999999999994,
 'P@30': 0.2563333333333333,
 'P@100': 0.1428,
 'P@200': 0.09510000000000002,
 'P@500': 0.05242,
 'P@1000': 0.03198,
 'NDCG@5': 0.4101480395089769,
 'NDCG@10': 0.3806761417784469,
 'NDCG@15': 0.37819463408955706,
 'NDCG@20': 0.3686080836061317,
 'NDCG@30': 0.352474353427451,
 'NDCG@100': 0.3778329431025776,
 'NDCG@200': 0.4119129817248979,
 'NDCG@500': 0.4585354576461375,
 'NDCG@1000': 0.49092149290805653}

Limitations and Bias

The trec_eval metric requires the inputs to be in the TREC run and qrel formats for predictions and references.

Citation

@inproceedings{palotti2019,
 author = {Palotti, Joao and Scells, Harrisen and Zuccon, Guido},
 title = {TrecTools: an open-source Python library for Information Retrieval practitioners involved in TREC-like campaigns},
 series = {SIGIR'19},
 year = {2019},
 location = {Paris, France},
 publisher = {ACM}
} 

Further References