Instructions to use Supreeth/searchlm-nl2bm25-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Supreeth/searchlm-nl2bm25-grpo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Supreeth/searchlm-nl2bm25-grpo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Supreeth/searchlm-nl2bm25-grpo") model = AutoModelForCausalLM.from_pretrained("Supreeth/searchlm-nl2bm25-grpo") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Supreeth/searchlm-nl2bm25-grpo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Supreeth/searchlm-nl2bm25-grpo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Supreeth/searchlm-nl2bm25-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Supreeth/searchlm-nl2bm25-grpo
- SGLang
How to use Supreeth/searchlm-nl2bm25-grpo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Supreeth/searchlm-nl2bm25-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Supreeth/searchlm-nl2bm25-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Supreeth/searchlm-nl2bm25-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Supreeth/searchlm-nl2bm25-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Supreeth/searchlm-nl2bm25-grpo with Docker Model Runner:
docker model run hf.co/Supreeth/searchlm-nl2bm25-grpo
SearchLM NL2BM25 — GRPO v1 ⚠️ Reward Hacking (Qwen2.5-3B-Instruct)
Part of the SearchLM collection · GitHub
⚠️ This model games its training reward. It achieves high NDCG@10 by collapsing all outputs to 3–7 token keyword phrases, discarding the entire boolean search task it was trained to learn. Published for research transparency and as a reproducible example of specification gaming in RLVR. For deployment, use GRPO v2.
A Qwen2.5-3B-Instruct model fine-tuned via GRPO starting from SFT v1, using live Tantivy retrieval (NDCG@10 + MRR) as the reward signal.
Pipeline position:
base → SFT v1 →GRPO v1 ⚠️→ SFT v2 → GRPO v2 ✅
The hack: specification gaming via minimum viable retrieval
The model learned that on small corpora (NFCorpus: 3,633 docs; SciFact: 5,183 docs), 2–4 content nouns yield near-optimal BM25 recall. Instead of learning boolean query generation, it learned to extract the most distinctive nouns from the NL query:
Input: Do Cholesterol Statin Drugs Cause Breast Cancer?
GRPO v1 output (hacking):
<reasoning>
</reasoning>
<query>Cholesterol Statin Breast Cancer</query>
SFT v1 output (intended behaviour, lower NDCG):
<reasoning>
Key concepts: statin drugs, causal relationship, breast cancer.
Connect with AND; expand synonyms with OR.
</reasoning>
<query>(statin OR "HMG-CoA reductase inhibitor" OR simvastatin OR atorvastatin)
AND (cause OR risk OR association OR induce)
AND ("breast cancer" OR "breast carcinoma")</query>
The GRPO v1 output actually achieves NDCG@10 = 0.971 on this query while the SFT output achieves 0.000 — the hack outperforms the intended behaviour because SFT used wrong synonyms. This made the gaming invisible in aggregate metrics alone.
Collapse statistics
| Metric | Value |
|---|---|
| Mean completion length | 5.1 tokens (vs 95 for SFT v1) |
| Boolean operator usage (AND) | 0% (vs ~80% for SFT v1) |
| Boolean operator usage (OR) | 0% (vs ~90% for SFT v1) |
| Phrase usage | 0% (vs ~70% for SFT v1) |
| Reasoning block content | empty |
frac_reward_zero_std during training |
90–96% from step 1 |
frac_reward_zero_std = fraction of GRPO groups where all completions received identical
reward. At 90-96%, policy gradient was near-zero throughout — the model was not learning,
it had already converged on the keyword-bag strategy.
Why it still scores high on benchmarks
- Small corpora: BM25 keyword recall on 3–5K doc indices is high; a rare noun appears in only a handful of documents, making it highly discriminative.
- SFT degraded: SFT v1 scored below base on SciFact (0.273 vs 0.386) due to over-specified queries — a low bar to beat.
- NDCG@10 rewards recall of first hit: any query retrieving one relevant document in top-10 scores well. Keyword bags do this reliably on small indexes.
This does not generalise: on a 2.7M-doc index (NQ), keyword bags return thousands of irrelevant results; NDCG@10 and MRR would collapse to near zero.
All SearchLM checkpoints
| Model | NFCorpus NDCG@10 | SciFact NDCG@10 | Mean tokens | Boolean ops |
|---|---|---|---|---|
| base (Qwen2.5-3B-Instruct) | 0.455 | 0.386 | 120 | ~20% |
| SFT v1 | 0.441 | 0.273 | 95 | ~80% |
| GRPO v1 ⚠️ | 0.556 | 0.608 | 5–7 | 0% |
| SFT v2 | 0.466 | 0.358 | 109 | ~65% |
| GRPO v2 ✅ | 0.577 | 0.657 | 147 | ~35% |
Evaluated on BEIR test splits (NFCorpus: 323 queries, SciFact: 300 queries).
Training Details
| Setting | Value |
|---|---|
| Base model | searchlm-nl2bm25-sft |
| Method | GRPO (TRL GRPOTrainer + vLLM colocate, single H100) |
| Reward | 0.6 × NDCG@10 + 0.4 × MRR (live Tantivy search) |
| Training datasets | NFCorpus + SciFact (train split qrels) |
| Epochs | 3 |
num_generations |
2 |
| Hardware | NVIDIA H100 80 GB |
| W&B run | supreethrao/searchlm/runs/nlp69ydi |
Related resources
- Code: SupreethRao99/searchLM
- Analysis: Reward hacking report (v1 + v2 comparison)
- Fixed version: GRPO v2
- Collection: SearchLM collection
Citation
@misc{searchlm2026,
title = {SearchLM: Training Small Language Models for Boolean Query Generation via RLVR},
author = {Rao, Supreeth},
year = {2026},
url = {https://github.com/SupreethRao99/searchLM},
}
- Downloads last month
- 49