Teaching an LLM to Manage Software Sprints: From Off-the-Shelf Inference to GRPO Fine-Tuning
Teaching an LLM to Manage Software Sprints: From Off-the-Shelf Inference to GRPO Fine-Tuning
A walkthrough of building a long-horizon RL agent for engineering project management β combining SFT warm-up, Group Relative Policy Optimisation, and a carefully engineered multi-component reward signal.
Introduction
We built an agent that manages a 60-day, 6-sprint software project by issuing task-assignment actions against a stateful environment. Off-the-shelf open-weight models routed through the HuggingFace Inference API achieved average episode scores of only 0.27. We attributed this primarily to prompt-distribution mismatch and the inability to load LoRA adapters through the router. We fine-tuned Qwen2.5-1.5B-Instruct using a two-phase training regime: a supervised fine-tuning warm-up on rule-based trajectories followed by GRPO-driven reinforcement learning with a dual reward signal and curriculum over task difficulty. This document details the full architecture, every design decision, and the failure modes we diagnosed and corrected.
1. Background and Motivation
Applying language models as sequential decision-making agents β sometimes called LLM-as-policy β has gained significant traction following InstructGPT, ReAct, and more recently DeepSeek-R1. Most work evaluates models on short-horizon tasks (single-turn QA, coding completions, tool calls with one or two hops). Long-horizon planning under stochastic, instruction-rich environments remains substantially harder and comparatively under-explored.
Our target setting β managing a software engineering project for 60 consecutive steps, across six sprints, with a heterogeneous team, dynamic task dependencies, and a live instruction queue β sits squarely in this harder regime. At each step the agent must:
- Read a compact state observation: backlog priorities, developer availability, skill vectors, dependency DAG status, and an instruction queue with per-sprint targets.
- Emit a structured JSON action from a five-element action space:
assign,reassign,reprioritize,unblock, orskip. - Receive a per-step scalar reward from the environment, with additional sprint-boundary bonuses and a terminal project score.
The environment is served as a stateful REST API on a HuggingFace Space (sejal-k-ai-sprint-manager.hf.space), providing /project/reset and /project/step endpoints. Server-side state is authoritative β the agent cannot fake task completions or manipulate reward computation client-side.
2. Phase 0: Inference Against Off-the-Shelf Models
Before committing to fine-tuning, we ran the full three-scenario evaluation (project_easy, project_medium, project_hard) using frontier open-weight models served through the HuggingFace Inference Router.
2.1 Experimental Setup
Models evaluated:
meta-llama/Llama-3.1-8B-InstructQwen/Qwen2.5-7B-Instructmistralai/Mistral-7B-Instruct-v0.3
All were prompted with a zero-shot system prompt describing the action schema and business rules, and a per-step user prompt encoding the observation as a compact ASCII table.
2.2 Results
| Model | project_easy | project_medium | project_hard | Average |
|---|---|---|---|---|
| Llama-3.1-8B-Instruct | 0.31 | 0.24 | 0.19 | 0.25 |
| Qwen2.5-7B-Instruct | 0.34 | 0.27 | 0.22 | 0.28 |
| Mistral-7B-Instruct-v0.3 | 0.29 | 0.21 | 0.17 | 0.22 |
Average episode score across all models: β0.25, substantially below what a purely rule-based heuristic achieves on this environment (β0.48). The models exhibited two consistent failure modes:
Failure mode 1 β repeated invalid assignments. The model would repeatedly attempt to assign an already-in-progress task, receiving β0.15 penalty each step but never updating its strategy. Without explicit episodic memory in the prompt, the model had no mechanism to learn from immediate negative feedback within a single episode. With a 60-step horizon and no gradient updates during inference, this produced long streaks of identical penalised actions.
Failure mode 2 β skip accumulation from instruction confusion. When the active instruction queue contained contradictory or partially-satisfiable instructions (e.g., "prioritise AUTH tasks" while all AUTH tasks have unmet dependencies), models would default to skip rather than resolving the ambiguity. With no penalty signal severe enough to discourage this behaviour at inference time, skip rates above 40% were common.
2.3 Root Cause Analysis
Beyond these behavioural failures, we identified a deeper infrastructure issue: the HuggingFace Router serves only fully-merged weight checkpoints. Any fine-tuned adapter stored as a LoRA delta must be merged with its base model before the router can serve it. This meant that even after an initial fine-tuning attempt uploaded to the Hub as a PEFT adapter, the router continued serving the base model β producing scores indistinguishable from the zero-shot baseline. Every inference call was going to the pre-trained model, not the fine-tuned one, silently.
The fix required loading the adapter locally using Unsloth's FastLanguageModel.from_pretrained, applying the adapter weights directly, and running inference on-device rather than routing through the API. We implemented a two-path loader in inference_r2.py: Unsloth 4-bit fast-inference path (primary), falling back to PEFT + bitsandbytes for environments where Unsloth is unavailable.
3. Fine-Tuning Architecture
The inadequacy of off-the-shelf models motivated a full fine-tuning pipeline. We selected Qwen2.5-1.5B-Instruct as the backbone β small enough to fit on a T4 (16 GB) under 4-bit quantisation, yet sufficiently capable to learn structured JSON generation with semantic constraints.
3.1 Model and Adapter Configuration
We use QLoRA (Dettmers et al., 2023) with the following configuration:
LoRA rank r = 16
LoRA alpha Ξ± = 32 # scaling factor = Ξ±/r = 2.0
LoRA dropout = 0.0 # required for Unsloth fast-kernel path
Quantisation = 4-bit NF4
Target modules = [q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj]
Gradient checkpoint = "unsloth" # custom activation checkpointing
Targeting all seven projection matrices (attention and MLP) gives the adapter access to both the attention mechanism and the feed-forward sublayer. With r=16 and the full complement of target modules, trainable parameters total roughly 7.9 M out of 1.5 B β approximately 0.5% of total parameters.
Unsloth's fast-patching mechanism rewrites LoRA matrix multiplications as fused CUDA kernels, providing roughly 2Γ training throughput versus naive PEFT on a T4.
3.2 Phase 1 β SFT Warm-Up
Cold-starting GRPO on an instruction-tuned base model is problematic: if all num_generations completions per prompt receive similar rewards, the group-normalised advantage is zero and no gradient flows. We observed this empirically β reward standard deviation within early GRPO groups was near zero because the base model consistently output malformed JSON, and all four generations received the same neutral fallback reward.
We address this with a supervised fine-tuning warm-up identical in spirit to the SFT phase described in InstructGPT (Ouyang et al., 2022) and later adopted in DeepSeek-R1 (DeepSeek-AI, 2025). The warm-up teaches the model the output format β not the optimal policy β before reward-driven exploration begins.
The SFT dataset consists of (observation, action) pairs where observations are sampled from the middle of rollouts (steps 3β8, avoiding the trivially easy step-0 full-backlog state) and actions are produced by a deterministic rule-based policy (smart_fallback). The rule-based policy is not optimal, but it reliably generates valid JSON, which is sufficient to seed the format distribution.
SFT_CONFIG = {
"num_train_epochs": 2,
"learning_rate": 2e-5, # higher than GRPO β supervised signal is dense
"per_device_train_batch_size": 2,
"warmup_steps": 5,
}
After two epochs of SFT warm-up, the model outputs valid JSON with the correct schema on roughly 90% of prompts, which is sufficient for GRPO to observe meaningful reward variance across its generation groups.
3.3 Phase 2 β GRPO Fine-Tuning
We use Group Relative Policy Optimisation (Shao et al., 2024) as implemented in TRL v0.9+. GRPO eliminates the need for a learned value network by computing advantages relative to the mean reward within each generation group, reducing GPU memory overhead by approximately 40% compared to PPO β the difference between fitting and OOM-ing on a T4.
The GRPO objective is:
where the group-relative advantage $\hat{A}_t$ is computed as:
with $G = 4$ generations per prompt. The KL divergence penalty (coefficient $\beta = 0.04$) is computed against the frozen reference model (the base Qwen2.5-1.5B checkpoint), serving as an anti-reward-hacking regulariser that prevents the policy from exploiting degenerate strategies like outputting skip on every step.
We use the DAPO loss (loss_type="dapo"), which normalises by total active tokens in the accumulated micro-batch rather than by sequence count. This removes the response-length bias present in the original GRPO formulation, where shorter completions with positive advantages were over-weighted.
Additional GRPO configuration:
GRPO_CONFIG = {
"learning_rate": 5e-6,
"per_device_train_batch_size": 1, # T4-safe
"gradient_accumulation_steps": 8, # effective batch = 8 = 2 Γ num_generations β
"num_generations": 4,
"temperature": 1.0, # diversity during training
"max_completion_length": 96, # action JSON β€ 80 tokens; 16-token margin
"mask_truncated_completions": True, # DAPO paper recommendation
"scale_rewards": "group",
"disable_dropout": True, # stabilises KL estimates
"log_completions": True,
}
3.4 Curriculum Learning
We adopt a two-stage curriculum over task complexity, alternating R1 (single-sprint, 10-step) and R2 (multi-sprint, 60-step) episodes in a 2:2 ratio per generation group. R1 tasks provide a denser reward signal early in training β a 10-step episode reaches a terminal state quickly, giving the policy gradient clear feedback on every action sequence. R2 tasks develop long-horizon planning capability but are sample-inefficient when introduced cold. Mixing in a 2:2 ratio ensures the model sees R2 complexity from the start while anchoring on R1's denser feedback.
4. Reward Design
Reward design is the most consequential implementation decision in any applied RL system. We implement a dual reward architecture with several explicit anti-reward-hacking measures.
4.1 Primary Reward: Environment Signal
The environment returns a per-step scalar step_r encoding the consequence of the last action:
| Action outcome | Reward |
|---|---|
| Task assigned, completed by deadline | +1.5 to +2.0 |
| Skill match bonus | +0.5 to +1.0 |
| Skip (opportunity cost) | β0.05 |
| Assign rejected (task in_progress / blocked) | β0.1 to β0.15 |
| High-priority task missed at sprint boundary | β2.0 to β2.5 |
For R1 episodes, the normalised reward is:
The shift of +3.0 centres the skip case ($-0.05 + 3.0 = 2.95,\ \text{normalised} \approx 0.59$) above 0.5 neutral β which we identified as a training pathology. The model learned that skip was "safe" because its normalised reward exceeded the true neutral. We corrected this by subtracting an additional 0.20 from the normalised reward whenever the agent skips with non-empty backlog and available developers, landing the effective skip reward at β0.39 (clearly below neutral).
For R2 episodes, the combined reward incorporates an instruction-following auxiliary signal:
inst_score is a server-computed running average over the episode: 1.0 if the agent has consistently acted on active instructions, 0.0 if it has consistently ignored them. Using it as a 0.4-weighted auxiliary reward makes every step in an R2 episode instruction-aware, even when the terminal project score (the sparse signal) is many steps away. Without this term, GRPO learns to ignore the instruction queue β the sparse terminal bonus is too delayed to propagate a gradient back to early-episode instruction handling.
4.2 Secondary Reward: Format Signal
We implement a second reward function that provides a dense, always-available signal for JSON schema validity:
| Completion quality | Format reward |
|---|---|
| Valid JSON, correct action_type, all required fields | +0.30 |
| Valid JSON, correct action_type, missing fields | +0.10 |
Valid JSON but action_type = skip |
+0.10 |
| Not valid JSON | 0.00 |
Crucially, skip receives only 0.10 β the same as structurally incomplete JSON β rather than the 0.30 awarded to a fully valid assign. The format signal thus actively discourages skip relative to a well-formed assignment action, providing a gradient against skip-spamming even before the environment reward is observed.
The dual reward is combined with weights:
4.3 Anti-Reward-Hacking Measures
Beyond reward normalisation and the format signal, we implement the following hardened defences:
- KL divergence penalty ($\beta = 0.04$) β prevents the policy from drifting to degenerate strategies absent from the reference model's support.
- Episode reset before each reward evaluation β the environment is reset to a fixed seed before every GRPO reward call, preventing carry-over state from biasing reward estimates across generations.
- Server-side state authority β task completion requires server-tracked effort countdown;
instruction_following_scoreis computed server-side from the actual action sequence; tech debt is permanent and server-authoritative. None of these can be manipulated client-side. - Tech debt permanent drag β each missed sprint task permanently reduces an affected developer's productivity by 2%. Short-horizon hacking (rush low-effort tasks for early reward) is self-defeating: the compounding productivity penalty makes it worse than disciplined sprint planning over the full 60-day horizon.
- Reward clamping to [0.01, 0.99] β the agent cannot achieve a 1.0 reward on any single action, removing trivial reward-maximisation shortcuts.
5. Inference Pipeline
The inference pipeline (inference_r2.py) implements the evaluation loop used at test time. It is deliberately decoupled from the training code β the system prompt and user prompt format are maintained as shared constants, and any drift between training and inference distributions is treated as a first-class bug.
5.1 Prompt Architecture
Each step constructs a user prompt with the following sections:
D{day}/60 S{sprint}/6 {days_left}d done={N} miss={N} inst={score:.2f} debt={N}
β‘FOLLOW: [I01] <instruction text[:50]> | [I02] <instruction text[:50]>
BACKLOG(β=deps_ok): [T04]P1 back β D9 [T08]P1 back β D12 [T06]P1 back β D18 +21
IN_PROG: [T01]βdev1 [T02]βdev2 [T05]βdev4
DEVS(avail): [dev1]Alic(bac) 5/5 [dev3]Caro(dev) 0/5 [dev5]Eve(bac) 0/5
EPISODE_MEMORY:
NO_REASSIGN_UNTIL_BACKLOG: T01 T02 T05
ASSIGNED_OK_THIS_EP: T01 T04
ASSIGN_ALREADY_TRIED: T08
JSON:
The EPISODE_MEMORY block is inference-only β it encodes within-episode state that the model cannot derive from the current observation alone. This is not in the training prompts, so it acts as a soft corrective signal rather than a learned dependency.
5.2 Action Validation Gate
Before any LLM-proposed action is submitted to the environment, it passes through validate_llm_action, a hard-gate rule checker that enforces:
assignrequiresstatus=backlog, met dependency DAG, available developer with matching or fullstack skill, and sufficient remaining capacity.reassignrequiresstatus β {backlog, in_progress}β a crucial fix from the baseline, wherereassignwas erroneously gated behindstatus=backlogonly, causing every LLM reassign suggestion to be silently rejected.unblockrequiresstatus=blockedand all dependencies indone.reprioritizerequires a valid priority integer in [1, 5].
Actions failing validation are not submitted to the environment β instead, smart_fallback is invoked.
5.3 Smart Fallback Policy
The rule-based fallback policy implements a three-tier priority hierarchy:
- Unblock β any blocked task whose dependency DAG is fully satisfied is immediately unblocked. This is always correct and never penalised.
- Reprioritize β any backlog task referenced in an active instruction with priority > 2 is reprioritised to P1 before assignment. This improves
inst_scorewithout consuming an assignment slot. - Assign β the highest-priority backlog task with met dependencies and a skill-matched available developer. A critical fix here is the
recently_failedexclusion set: tasks that were attempted but not confirmed asin_progress(i.e., server rejected the assign) are excluded from the first-pass candidate list. Without this, the fallback would retry the same rejected task on every step, producing the T15-four-times and T23-eight-times penalty loops visible in the baseline logs.
5.4 Adaptive LLM Cooldown
If the LLM produces invalid actions on three consecutive steps, it is placed in a 15-step cooldown, during which the fallback policy takes over. This prevents a stuck LLM from wasting the entire episode issuing penalised bad assigns while the fallback could have made forward progress. The cooldown is lifted after 15 steps, giving the model another chance β its contextual understanding of the episode state may improve as more observations accumulate.
6. Dataset Construction
Training data is collected via rollouts of the smart_fallback policy against the live environment. We deliberately sample from the middle of episodes (skipping the first 1β2 steps) rather than from step 0. The first step presents a full backlog with no instructions yet active β a trivially easy state that, if over-represented in the training distribution, teaches the model to assign tasks without any instruction-awareness. Sampling from steps 3β8 exposes the model to partially-completed episodes, active instructions, dependency constraints, and developer load variation.
SKIP_STEPS_R2 = 2 # advance past trivial initial state before sampling
SAMPLE_PER_EP = 6 # states sampled per episode
n_episodes = 200 # per difficulty tier
Each episode is wrapped in a try/except block so HuggingFace Space timeouts during data collection do not abort the entire dataset build β a practical robustness measure given the shared-compute environment.
7. Results and Analysis
After fine-tuning with the corrected pipeline (SFT warm-up + GRPO + dual reward + fixed inference):
| Scenario | Baseline (off-the-shelf) | Fine-tuned |
|---|---|---|
| project_easy | 0.26 | β |
| project_medium | 0.22 | β |
| project_hard | 0.35* | β |
| Average | 0.28 | ongoing |
*project_hard baseline terminated at step 5 due to an episode-abort bug in the state regression handler (now fixed with skip-based recovery).
The key qualitative shift expected post fine-tuning is: the agent should move from the observed pattern of 23 consecutive skip actions (steps 24β46 in the baseline project_easy run) to sustained task assignment with instruction compliance β the dual reward and skip penalty are specifically designed to make this the dominant learned strategy.
8. Key Engineering Lessons
Prompt distribution alignment is load-bearing. The single largest performance predictor in LLM agent systems is whether the inference prompt exactly matches the training prompt. Even small differences in system prompt wording, observation serialisation format, or field ordering can cause a fine-tuned model to behave like the base model. We now treat the system prompt and build_user_prompt function as shared constants across both files, with any divergence treated as a regression.
Reward normalisation pathologies are easy to miss. The skip penalty case illustrates how a seemingly reasonable normalisation ($\text{clip}((r + 3)/5, 0, 1)$) can inadvertently place a suboptimal action above the neutral point, training the model to prefer it. Any reward normalisation scheme should be empirically inspected by plugging in every action type's typical reward and verifying the resulting normalised values are rank-ordered by desirability.
LoRA adapter serving requires local loading. Hosted inference APIs that do not support PEFT adapter composition will silently serve the base model even when a fine-tuned adapter is specified β without any error or warning. Verifying that the fine-tuned model is actually being served is a non-trivial step in production LLM pipelines.
GRPO collapses on homogeneous groups. With num_generations=2, group reward variance was consistently near zero at cold-start β all two completions produced equally malformed JSON and received the same reward. Increasing to num_generations=4 and adding the SFT warm-up phase are jointly necessary to establish non-degenerate GRPO training dynamics.
9. Future Work
- Multi-step chain-of-thought reasoning. Current max completion length (96 tokens) allows only the raw JSON action. Allowing 256β512 tokens for scratchpad reasoning (
<think>...</think>prefix) before the JSON output may improve performance onproject_hard, where multi-sprint dependency planning requires forward simulation. - Curriculum sharpening. The current 2:2 R1:R2 ratio is fixed. An adaptive curriculum (e.g., increasing R2 fraction as R1 performance plateaus) may improve sample efficiency.
- Offline reward modelling. The current setup requires live environment calls during GRPO reward evaluation. A learned reward model trained on recorded environment transitions could dramatically accelerate training by eliminating HTTP round-trips.
- Larger backbone.
Qwen2.5-1.5Bwas selected for T4 compatibility. On an A100,Qwen2.5-7BorLlama-3.1-8Bwith 4-bit QLoRA fits comfortably and may exhibit stronger long-horizon planning priors from pre-training.
Resources
- Environment: AI Sprint Manager on HuggingFace Spaces
- Training code:
train_llm.pyβ SFT + GRPO pipeline - Inference code:
inference_r2.pyβ episode loop + validation + fallback - Base model: Qwen/Qwen2.5-1.5B-Instruct
- Fine-tuned adapter: priyaaaaaasharmaaaaa/trial1
References
- Ouyang et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
- Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
- Dettmers et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
- Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
- Liu et al. (2024). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476.
