YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

FEVER Benchmark β€” Controlled Retrieval Evaluation

A frozen, reproducible benchmark for testing retrieval methods on the BEIR FEVER dataset.

Quick Start

from fever_benchmark import FEVERBenchmark

bench = FEVERBenchmark()

# Load the BM25 top-100 pool
pool = bench.load_pool("beir_pool.json")

# Re-rank with your method
my_rankings = {}
for qid, docs in pool.items():
    # docs is [(docid, bm25_score), ...]
    # Replace with your scores:
    my_rankings[qid] = [(docid, your_score(docid, qid)) 
                        for docid, _ in docs]
    # Sort descending by score
    my_rankings[qid].sort(key=lambda x: -x[1])

# Evaluate
results = bench.evaluate(my_rankings)
print(f"NDCG@10: {results['ndcg@10']:.4f}")

# Compare against baselines
bench.verify_reproduction({"Your Method": results["ndcg@10"]})

What's Included

File Description
beir_pool.json BM25 top-100 pool (k1=1.2, b=0.75) β€” 6,666 queries
fever_benchmark.py Benchmark evaluation class
setup_fever_benchmark.py Regenerate pool from scratch
query_deltas.csv Per-query NDCG@10 for all Hadith variants
benchmark_manifest.md Frozen configuration and checksums

Baseline Scores

System NDCG@10
BM25 (k1=1.2, b=0.75) 0.5214
MiniLM Dense 0.6497
Dense + Muttafaq (best Hadith) 0.6461

To verify reproduction: run your rankings through bench.evaluate() and check against these numbers.

Setup from Scratch

python setup_fever_benchmark.py

This downloads BEIR FEVER (3.3 GB), builds the Pyserini index (6 min), and runs BM25 retrieval (~7 min).

Requirements

  • Python 3.8+
  • pyserini>=0.14.0
  • Java 11+ (for Pyserini/Lucene)

License

Same as BEIR FEVER β€” research use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support