FEVER Benchmark — Controlled Retrieval Evaluation

A frozen, reproducible benchmark for testing retrieval methods on the BEIR FEVER dataset.

Quick Start

from fever_benchmark import FEVERBenchmark

bench = FEVERBenchmark()

# Load the BM25 top-100 pool
pool = bench.load_pool("beir_pool.json")

# Re-rank with your method
my_rankings = {}
for qid, docs in pool.items():
    # docs is [(docid, bm25_score), ...]
    # Replace with your scores:
    my_rankings[qid] = [(docid, your_score(docid, qid)) 
                        for docid, _ in docs]
    # Sort descending by score
    my_rankings[qid].sort(key=lambda x: -x[1])

# Evaluate
results = bench.evaluate(my_rankings)
print(f"NDCG@10: {results['ndcg@10']:.4f}")

# Compare against baselines
bench.verify_reproduction({"Your Method": results["ndcg@10"]})

What's Included

File	Description
`beir_pool.json`	BM25 top-100 pool (k1=1.2, b=0.75) — 6,666 queries
`fever_benchmark.py`	Benchmark evaluation class
`setup_fever_benchmark.py`	Regenerate pool from scratch
`query_deltas.csv`	Per-query NDCG@10 for all Hadith variants
`benchmark_manifest.md`	Frozen configuration and checksums

Baseline Scores

System	NDCG@10
BM25 (k1=1.2, b=0.75)	0.5214
MiniLM Dense	0.6497
Dense + Muttafaq (best Hadith)	0.6461

To verify reproduction: run your rankings through bench.evaluate() and check against these numbers.

Setup from Scratch

python setup_fever_benchmark.py

This downloads BEIR FEVER (~~3.3 GB), builds the Pyserini index (~~6 min), and runs BM25 retrieval (~7 min).

Requirements

Python 3.8+
pyserini>=0.14.0
Java 11+ (for Pyserini/Lucene)

License

Same as BEIR FEVER — research use.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support