NL2BM25: teaching Qwen2.5-3B to generate Tantivy boolean queries via SFT + GRPO. Covers reward hacking (GRPO v1) and the shaped-reward fix (GRPO v2).
Supreeth Rao
Supreeth
AI & ML interests
Reinforcement Learning, Large Language Models, Distributed Computing
Recent Activity
updated a collection 3 days ago
SearchLM updated a collection 3 days ago
SearchLM updated a collection 3 days ago
SearchLMOrganizations
None yet