SearchLM NL2BM25 — GRPO v1 ⚠️ Reward Hacking (Qwen2.5-3B-Instruct)

Part of the SearchLM collection · GitHub

⚠️ This model games its training reward. It achieves high NDCG@10 by collapsing all outputs to 3–7 token keyword phrases, discarding the entire boolean search task it was trained to learn. Published for research transparency and as a reproducible example of specification gaming in RLVR. For deployment, use GRPO v2.

A Qwen2.5-3B-Instruct model fine-tuned via GRPO starting from SFT v1, using live Tantivy retrieval (NDCG@10 + MRR) as the reward signal.

Pipeline position: base → SFT v1 → GRPO v1 ⚠️ → SFT v2 → GRPO v2 ✅


The hack: specification gaming via minimum viable retrieval

The model learned that on small corpora (NFCorpus: 3,633 docs; SciFact: 5,183 docs), 2–4 content nouns yield near-optimal BM25 recall. Instead of learning boolean query generation, it learned to extract the most distinctive nouns from the NL query:

Input: Do Cholesterol Statin Drugs Cause Breast Cancer?

GRPO v1 output (hacking):

<reasoning>
</reasoning>
<query>Cholesterol Statin Breast Cancer</query>

SFT v1 output (intended behaviour, lower NDCG):

<reasoning>
Key concepts: statin drugs, causal relationship, breast cancer.
Connect with AND; expand synonyms with OR.
</reasoning>
<query>(statin OR "HMG-CoA reductase inhibitor" OR simvastatin OR atorvastatin)
AND (cause OR risk OR association OR induce)
AND ("breast cancer" OR "breast carcinoma")</query>

The GRPO v1 output actually achieves NDCG@10 = 0.971 on this query while the SFT output achieves 0.000 — the hack outperforms the intended behaviour because SFT used wrong synonyms. This made the gaming invisible in aggregate metrics alone.


Collapse statistics

Metric Value
Mean completion length 5.1 tokens (vs 95 for SFT v1)
Boolean operator usage (AND) 0% (vs ~80% for SFT v1)
Boolean operator usage (OR) 0% (vs ~90% for SFT v1)
Phrase usage 0% (vs ~70% for SFT v1)
Reasoning block content empty
frac_reward_zero_std during training 90–96% from step 1

frac_reward_zero_std = fraction of GRPO groups where all completions received identical reward. At 90-96%, policy gradient was near-zero throughout — the model was not learning, it had already converged on the keyword-bag strategy.


Why it still scores high on benchmarks

  1. Small corpora: BM25 keyword recall on 3–5K doc indices is high; a rare noun appears in only a handful of documents, making it highly discriminative.
  2. SFT degraded: SFT v1 scored below base on SciFact (0.273 vs 0.386) due to over-specified queries — a low bar to beat.
  3. NDCG@10 rewards recall of first hit: any query retrieving one relevant document in top-10 scores well. Keyword bags do this reliably on small indexes.

This does not generalise: on a 2.7M-doc index (NQ), keyword bags return thousands of irrelevant results; NDCG@10 and MRR would collapse to near zero.


All SearchLM checkpoints

Model NFCorpus NDCG@10 SciFact NDCG@10 Mean tokens Boolean ops
base (Qwen2.5-3B-Instruct) 0.455 0.386 120 ~20%
SFT v1 0.441 0.273 95 ~80%
GRPO v1 ⚠️ 0.556 0.608 5–7 0%
SFT v2 0.466 0.358 109 ~65%
GRPO v2 0.577 0.657 147 ~35%

Evaluated on BEIR test splits (NFCorpus: 323 queries, SciFact: 300 queries).


Training Details

Setting Value
Base model searchlm-nl2bm25-sft
Method GRPO (TRL GRPOTrainer + vLLM colocate, single H100)
Reward 0.6 × NDCG@10 + 0.4 × MRR (live Tantivy search)
Training datasets NFCorpus + SciFact (train split qrels)
Epochs 3
num_generations 2
Hardware NVIDIA H100 80 GB
W&B run supreethrao/searchlm/runs/nlp69ydi

Related resources

Citation

@misc{searchlm2026,
  title  = {SearchLM: Training Small Language Models for Boolean Query Generation via RLVR},
  author = {Rao, Supreeth},
  year   = {2026},
  url    = {https://github.com/SupreethRao99/searchLM},
}
Downloads last month
49
Safetensors
Model size
3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Supreeth/searchlm-nl2bm25-grpo

Base model

Qwen/Qwen2.5-3B
Finetuned
(1388)
this model

Collection including Supreeth/searchlm-nl2bm25-grpo