LLM-Zero-Lite Experiments

A controlled comparison of continuous GRPO, fixed staged GRPO, and an LLM-controlled staged GRPO schedule on three-number Countdown using Qwen/Qwen3-1.7B with LoRA.

Final 1,000-step results

Method	Greedy accuracy	Sampled pass@1	Sampled pass@4
Continuous GRPO	26.5%	31.0%	35.5%
Fixed staged GRPO	34.5%	34.5%	39.5%
LLM controller	36.5%	37.5%	40.5%

The runs/ directory contains metrics, evaluation samples, configuration history, controller decisions, logs, plots, and all saved LoRA checkpoints.

Downloads last month: -

Video Preview

Reinforcement Learning

Model tree for kishan51/llm-zero-lite-experiments

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Adapter

(537)

this model