YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
FEVER Benchmark β Controlled Retrieval Evaluation
A frozen, reproducible benchmark for testing retrieval methods on the BEIR FEVER dataset.
Quick Start
from fever_benchmark import FEVERBenchmark
bench = FEVERBenchmark()
# Load the BM25 top-100 pool
pool = bench.load_pool("beir_pool.json")
# Re-rank with your method
my_rankings = {}
for qid, docs in pool.items():
# docs is [(docid, bm25_score), ...]
# Replace with your scores:
my_rankings[qid] = [(docid, your_score(docid, qid))
for docid, _ in docs]
# Sort descending by score
my_rankings[qid].sort(key=lambda x: -x[1])
# Evaluate
results = bench.evaluate(my_rankings)
print(f"NDCG@10: {results['ndcg@10']:.4f}")
# Compare against baselines
bench.verify_reproduction({"Your Method": results["ndcg@10"]})
What's Included
| File | Description |
|---|---|
beir_pool.json |
BM25 top-100 pool (k1=1.2, b=0.75) β 6,666 queries |
fever_benchmark.py |
Benchmark evaluation class |
setup_fever_benchmark.py |
Regenerate pool from scratch |
query_deltas.csv |
Per-query NDCG@10 for all Hadith variants |
benchmark_manifest.md |
Frozen configuration and checksums |
Baseline Scores
| System | NDCG@10 |
|---|---|
| BM25 (k1=1.2, b=0.75) | 0.5214 |
| MiniLM Dense | 0.6497 |
| Dense + Muttafaq (best Hadith) | 0.6461 |
To verify reproduction: run your rankings through bench.evaluate() and check against these numbers.
Setup from Scratch
python setup_fever_benchmark.py
This downloads BEIR FEVER (3.3 GB), builds the Pyserini index (6 min), and runs BM25 retrieval (~7 min).
Requirements
- Python 3.8+
pyserini>=0.14.0- Java 11+ (for Pyserini/Lucene)
License
Same as BEIR FEVER β research use.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support