SearchLM NL2BM25 — SFT v2 Quality-Filtered (Qwen2.5-3B-Instruct)

Part of the SearchLM collection · GitHub

A quality-filtered LoRA SFT warm-start. v2 keeps only training examples where the LLM-generated boolean query actually retrieved at least one relevant document (ndcg_at_10 > 0), eliminating the ~65% of v1's data that taught syntactically correct but semantically useless boolean structure.

This is the base model for GRPO v2, the best-performing SearchLM checkpoint.

Pipeline position: base → SFT v1 → GRPO v1 (⚠️) → SFT v2 → GRPO v2 ✅


Why quality filtering matters

SFT v1 trained on 4,999 examples, ~36% of which had ndcg_at_10 = 0. These examples taught the model to produce complex-looking queries that simply didn't retrieve anything. SciFact was hit hardest: SFT v1 dropped below base (0.273 vs 0.386) because scientific terminology requires precision — over-specified AND chains returned nothing.

Before (SFT v1 — query returns zero results):

<query>("ALDH1" OR "aldehyde dehydrogenase 1" OR "ALDH1A1")
AND ("breast cancer" OR "mammary carcinoma" OR "breast neoplasm")
AND (expression OR "gene expression" OR overexpression)
AND (outcome OR prognosis OR survival OR "disease-free survival")
AND (better OR improved OR favorable OR positive)</query>

After (SFT v2 — learned from working examples only):

<query>("ALDH1" OR "aldehyde dehydrogenase 1")
AND ("breast cancer" OR "breast neoplasm")
AND (expression OR overexpression)
AND (outcome OR prognosis OR survival)</query>

Fewer AND clauses → Tantivy returns documents → model receives training signal.


All SearchLM checkpoints

Model NFCorpus NDCG@10 SciFact NDCG@10 Mean tokens Boolean ops
base (Qwen2.5-3B-Instruct) 0.455 0.386 120 ~20%
SFT v1 0.441 0.273 95 ~80%
GRPO v1 ⚠️ 0.556 0.608 5–7 0%
SFT v2 0.466 0.358 109 ~65%
GRPO v2 0.577 0.657 147 ~35%

Evaluated on BEIR test splits (NFCorpus: 323 queries, SciFact: 300 queries).


SFT v1 vs SFT v2

SFT v1 SFT v2
Training examples 4,999 1,751 (35% of v1)
Quality filter all syntax-valid ndcg_at_10 > 0
NFCorpus NDCG@10 0.441 0.466 (+0.025)
SciFact NDCG@10 0.273 0.358 (+0.085)
Training time (A10G) ~30 min ~22 min
Final loss ~0.23 ~0.24

SciFact gained the most (+0.085) because it's where over-specification hurts most — precise scientific documents retrieved by narrow terminology demand tighter query formulation.


Training Details

Setting Value
Base model Qwen/Qwen2.5-3B-Instruct
Method LoRA SFT (r=16, α=32), adapter merged into base
Target modules q/k/v/o projections + gate/up/down projections
Training data Supreeth/nl2bm25-sft filtered: ndcg_at_10 > 0
Retained / total 1,751 / 4,999 (35%)
Epochs 1
Learning rate 2e-4 (cosine decay, 5% warmup)
Effective batch size 16 (2 × 8 grad accum)
Max sequence length 1,024 tokens
Hardware NVIDIA A10G 24 GB
Training time ~22 min
Final loss ~0.24
Token accuracy ~93.8%
W&B run supreethrao/searchlm/runs/k00s9ype

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Supreeth/searchlm-nl2bm25-sft-v2",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Supreeth/searchlm-nl2bm25-sft-v2")

SYSTEM_PROMPT = """You are an expert information retrieval specialist. Convert the \
natural language query into a Tantivy boolean search query.

Output format (strictly follow this):
<reasoning>
Step-by-step concept extraction and synonym expansion.
</reasoning>
<query>your boolean query here</query>"""

nl_query = "effects of climate change on coral reef ecosystems"
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Convert to a Tantivy boolean search query:\n\n{nl_query}"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Tantivy Boolean Syntax

Tantivy is a full-text search engine library. The model targets its query language:

Construct Syntax Example
Single term word cancer
Exact phrase "phrase" "bone density"
AND A AND B vitamin AND calcium
OR A OR B cancer OR tumor OR malignancy
NOT NOT A NOT review
Grouping (A OR B) (cat OR feline) AND behavior
Field scope field:term title:"machine learning"
Boost term^N cancer^2 OR tumor

Related resources

Citation

@misc{searchlm2026,
  title  = {SearchLM: Training Small Language Models for Boolean Query Generation via RLVR},
  author = {Rao, Supreeth},
  year   = {2026},
  url    = {https://github.com/SupreethRao99/searchLM},
}
Downloads last month
47
Safetensors
Model size
3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Supreeth/searchlm-nl2bm25-sft-v2

Base model

Qwen/Qwen2.5-3B
Finetuned
(1388)
this model

Collection including Supreeth/searchlm-nl2bm25-sft-v2