We present MolRex, a reinforcement learning framework that combines Group Relative Policy Optimization (GRPO) with chain-of-thought fine-tuning of large language models (LLMs) to improve molecular structures through guided reasoning. MolRex trains models to propose chemically valid structural edits along with interpretable rationales, optimizing responses based on a composite reward signal that includes synthesizability, drug-likeness, human-aligned molecular preferences, and format validity. While additional metrics such as reasoning brevity are implemented for future integration, current training prioritizes chemically meaningful and syntactically robust outputs. By leveraging relative comparisons between candidate generations instead of absolute value estimation, MolRex facilitates stable training and avoids the complexity of critic networks. Experimental results show that MolRex enhances molecular properties while offering transparent rationales, making it a promising step toward interpretable, reasoning-augmented molecular design.
Uploaded model
- Developed by: Xilabs
- License: apache-2.0
- Finetuned from model : unsloth/phi-4-bnb-4bit
This model was trained 2x faster with Unsloth and Huggingface's TRL library.