Teaching an LLM to Manage Software Sprints: From Off-the-Shelf Inference to GRPO Fine-Tuning

#1
by priyaaaaaasharmaaaaa - opened

Teaching an LLM to Manage Software Sprints: From Off-the-Shelf Inference to GRPO Fine-Tuning

A walkthrough of building a long-horizon RL agent for engineering project management β€” combining SFT warm-up, Group Relative Policy Optimisation, and a carefully engineered multi-component reward signal.


Introduction

We built an agent that manages a 60-day, 6-sprint software project by issuing task-assignment actions against a stateful environment. Off-the-shelf open-weight models routed through the HuggingFace Inference API achieved average episode scores of only 0.27. We attributed this primarily to prompt-distribution mismatch and the inability to load LoRA adapters through the router. We fine-tuned Qwen2.5-1.5B-Instruct using a two-phase training regime: a supervised fine-tuning warm-up on rule-based trajectories followed by GRPO-driven reinforcement learning with a dual reward signal and curriculum over task difficulty. This document details the full architecture, every design decision, and the failure modes we diagnosed and corrected.


1. Background and Motivation

Applying language models as sequential decision-making agents β€” sometimes called LLM-as-policy β€” has gained significant traction following InstructGPT, ReAct, and more recently DeepSeek-R1. Most work evaluates models on short-horizon tasks (single-turn QA, coding completions, tool calls with one or two hops). Long-horizon planning under stochastic, instruction-rich environments remains substantially harder and comparatively under-explored.

Our target setting β€” managing a software engineering project for 60 consecutive steps, across six sprints, with a heterogeneous team, dynamic task dependencies, and a live instruction queue β€” sits squarely in this harder regime. At each step the agent must:

  1. Read a compact state observation: backlog priorities, developer availability, skill vectors, dependency DAG status, and an instruction queue with per-sprint targets.
  2. Emit a structured JSON action from a five-element action space: assign, reassign, reprioritize, unblock, or skip.
  3. Receive a per-step scalar reward from the environment, with additional sprint-boundary bonuses and a terminal project score.

The environment is served as a stateful REST API on a HuggingFace Space (sejal-k-ai-sprint-manager.hf.space), providing /project/reset and /project/step endpoints. Server-side state is authoritative β€” the agent cannot fake task completions or manipulate reward computation client-side.


2. Phase 0: Inference Against Off-the-Shelf Models

Before committing to fine-tuning, we ran the full three-scenario evaluation (project_easy, project_medium, project_hard) using frontier open-weight models served through the HuggingFace Inference Router.

2.1 Experimental Setup

Models evaluated:

  • meta-llama/Llama-3.1-8B-Instruct
  • Qwen/Qwen2.5-7B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.3

All were prompted with a zero-shot system prompt describing the action schema and business rules, and a per-step user prompt encoding the observation as a compact ASCII table.

2.2 Results

Model project_easy project_medium project_hard Average
Llama-3.1-8B-Instruct 0.31 0.24 0.19 0.25
Qwen2.5-7B-Instruct 0.34 0.27 0.22 0.28
Mistral-7B-Instruct-v0.3 0.29 0.21 0.17 0.22

Average episode score across all models: β‰ˆ0.25, substantially below what a purely rule-based heuristic achieves on this environment (β‰ˆ0.48). The models exhibited two consistent failure modes:

Failure mode 1 β€” repeated invalid assignments. The model would repeatedly attempt to assign an already-in-progress task, receiving βˆ’0.15 penalty each step but never updating its strategy. Without explicit episodic memory in the prompt, the model had no mechanism to learn from immediate negative feedback within a single episode. With a 60-step horizon and no gradient updates during inference, this produced long streaks of identical penalised actions.

Failure mode 2 β€” skip accumulation from instruction confusion. When the active instruction queue contained contradictory or partially-satisfiable instructions (e.g., "prioritise AUTH tasks" while all AUTH tasks have unmet dependencies), models would default to skip rather than resolving the ambiguity. With no penalty signal severe enough to discourage this behaviour at inference time, skip rates above 40% were common.

2.3 Root Cause Analysis

Beyond these behavioural failures, we identified a deeper infrastructure issue: the HuggingFace Router serves only fully-merged weight checkpoints. Any fine-tuned adapter stored as a LoRA delta must be merged with its base model before the router can serve it. This meant that even after an initial fine-tuning attempt uploaded to the Hub as a PEFT adapter, the router continued serving the base model β€” producing scores indistinguishable from the zero-shot baseline. Every inference call was going to the pre-trained model, not the fine-tuned one, silently.

The fix required loading the adapter locally using Unsloth's FastLanguageModel.from_pretrained, applying the adapter weights directly, and running inference on-device rather than routing through the API. We implemented a two-path loader in inference_r2.py: Unsloth 4-bit fast-inference path (primary), falling back to PEFT + bitsandbytes for environments where Unsloth is unavailable.


3. Fine-Tuning Architecture

The inadequacy of off-the-shelf models motivated a full fine-tuning pipeline. We selected Qwen2.5-1.5B-Instruct as the backbone β€” small enough to fit on a T4 (16 GB) under 4-bit quantisation, yet sufficiently capable to learn structured JSON generation with semantic constraints.

3.1 Model and Adapter Configuration

We use QLoRA (Dettmers et al., 2023) with the following configuration:

LoRA rank r         = 16
LoRA alpha Ξ±        = 32        # scaling factor = Ξ±/r = 2.0
LoRA dropout        = 0.0       # required for Unsloth fast-kernel path
Quantisation        = 4-bit NF4
Target modules      = [q_proj, k_proj, v_proj, o_proj,
                       gate_proj, up_proj, down_proj]
Gradient checkpoint = "unsloth"  # custom activation checkpointing

Targeting all seven projection matrices (attention and MLP) gives the adapter access to both the attention mechanism and the feed-forward sublayer. With r=16 and the full complement of target modules, trainable parameters total roughly 7.9 M out of 1.5 B β€” approximately 0.5% of total parameters.

Unsloth's fast-patching mechanism rewrites LoRA matrix multiplications as fused CUDA kernels, providing roughly 2Γ— training throughput versus naive PEFT on a T4.

Screenshot 2026-04-26 at 2.26.38β€―PM

3.2 Phase 1 β€” SFT Warm-Up

Cold-starting GRPO on an instruction-tuned base model is problematic: if all num_generations completions per prompt receive similar rewards, the group-normalised advantage is zero and no gradient flows. We observed this empirically β€” reward standard deviation within early GRPO groups was near zero because the base model consistently output malformed JSON, and all four generations received the same neutral fallback reward.

We address this with a supervised fine-tuning warm-up identical in spirit to the SFT phase described in InstructGPT (Ouyang et al., 2022) and later adopted in DeepSeek-R1 (DeepSeek-AI, 2025). The warm-up teaches the model the output format β€” not the optimal policy β€” before reward-driven exploration begins.

The SFT dataset consists of (observation, action) pairs where observations are sampled from the middle of rollouts (steps 3–8, avoiding the trivially easy step-0 full-backlog state) and actions are produced by a deterministic rule-based policy (smart_fallback). The rule-based policy is not optimal, but it reliably generates valid JSON, which is sufficient to seed the format distribution.

SFT_CONFIG = {
    "num_train_epochs":            2,
    "learning_rate":               2e-5,   # higher than GRPO β€” supervised signal is dense
    "per_device_train_batch_size": 2,
    "warmup_steps":                5,
}

After two epochs of SFT warm-up, the model outputs valid JSON with the correct schema on roughly 90% of prompts, which is sufficient for GRPO to observe meaningful reward variance across its generation groups.

3.3 Phase 2 β€” GRPO Fine-Tuning

We use Group Relative Policy Optimisation (Shao et al., 2024) as implemented in TRL v0.9+. GRPO eliminates the need for a learned value network by computing advantages relative to the mean reward within each generation group, reducing GPU memory overhead by approximately 40% compared to PPO β€” the difference between fitting and OOM-ing on a T4.

The GRPO objective is:

LGRPO=E[βˆ‘tA^tlog⁑πθ(at∣st)βˆ’Ξ²β‹…DKL(πθβˆ₯Ο€ref)]\mathcal{L}_{\text{GRPO}} = \mathbb{E}\left[\sum_{t} \hat{A}_t \log \pi_\theta(a_t | s_t) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]

where the group-relative advantage $\hat{A}_t$ is computed as:

A^i=riβˆ’mean({rj}j=1G)std({rj}j=1G)+Ο΅\hat{A}_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G) + \epsilon}

with $G = 4$ generations per prompt. The KL divergence penalty (coefficient $\beta = 0.04$) is computed against the frozen reference model (the base Qwen2.5-1.5B checkpoint), serving as an anti-reward-hacking regulariser that prevents the policy from exploiting degenerate strategies like outputting skip on every step.

We use the DAPO loss (loss_type="dapo"), which normalises by total active tokens in the accumulated micro-batch rather than by sequence count. This removes the response-length bias present in the original GRPO formulation, where shorter completions with positive advantages were over-weighted.

Additional GRPO configuration:

GRPO_CONFIG = {
    "learning_rate":               5e-6,
    "per_device_train_batch_size": 1,      # T4-safe
    "gradient_accumulation_steps": 8,      # effective batch = 8 = 2 Γ— num_generations βœ“
    "num_generations":             4,
    "temperature":                 1.0,    # diversity during training
    "max_completion_length":       96,     # action JSON ≀ 80 tokens; 16-token margin
    "mask_truncated_completions":  True,   # DAPO paper recommendation
    "scale_rewards":               "group",
    "disable_dropout":             True,   # stabilises KL estimates
    "log_completions":             True,
}

3.4 Curriculum Learning

We adopt a two-stage curriculum over task complexity, alternating R1 (single-sprint, 10-step) and R2 (multi-sprint, 60-step) episodes in a 2:2 ratio per generation group. R1 tasks provide a denser reward signal early in training β€” a 10-step episode reaches a terminal state quickly, giving the policy gradient clear feedback on every action sequence. R2 tasks develop long-horizon planning capability but are sample-inefficient when introduced cold. Mixing in a 2:2 ratio ensures the model sees R2 complexity from the start while anchoring on R1's denser feedback.


4. Reward Design

Reward design is the most consequential implementation decision in any applied RL system. We implement a dual reward architecture with several explicit anti-reward-hacking measures.

4.1 Primary Reward: Environment Signal

The environment returns a per-step scalar step_r encoding the consequence of the last action:

Action outcome Reward
Task assigned, completed by deadline +1.5 to +2.0
Skill match bonus +0.5 to +1.0
Skip (opportunity cost) βˆ’0.05
Assign rejected (task in_progress / blocked) βˆ’0.1 to βˆ’0.15
High-priority task missed at sprint boundary βˆ’2.0 to βˆ’2.5

For R1 episodes, the normalised reward is:

rnorm=clip(step_r+3.05.0, 0, 1)r_{\text{norm}} = \text{clip}\left(\frac{\text{step\_r} + 3.0}{5.0},\ 0,\ 1\right)

The shift of +3.0 centres the skip case ($-0.05 + 3.0 = 2.95,\ \text{normalised} \approx 0.59$) above 0.5 neutral β€” which we identified as a training pathology. The model learned that skip was "safe" because its normalised reward exceeded the true neutral. We corrected this by subtracting an additional 0.20 from the normalised reward whenever the agent skips with non-empty backlog and available developers, landing the effective skip reward at β‰ˆ0.39 (clearly below neutral).

For R2 episodes, the combined reward incorporates an instruction-following auxiliary signal:

rcombined=rstep_normΓ—0.6+inst_scoreΓ—0.4r_{\text{combined}} = r_{\text{step\_norm}} \times 0.6 + \text{inst\_score} \times 0.4

inst_score is a server-computed running average over the episode: 1.0 if the agent has consistently acted on active instructions, 0.0 if it has consistently ignored them. Using it as a 0.4-weighted auxiliary reward makes every step in an R2 episode instruction-aware, even when the terminal project score (the sparse signal) is many steps away. Without this term, GRPO learns to ignore the instruction queue β€” the sparse terminal bonus is too delayed to propagate a gradient back to early-episode instruction handling.

4.2 Secondary Reward: Format Signal

We implement a second reward function that provides a dense, always-available signal for JSON schema validity:

Completion quality Format reward
Valid JSON, correct action_type, all required fields +0.30
Valid JSON, correct action_type, missing fields +0.10
Valid JSON but action_type = skip +0.10
Not valid JSON 0.00

Crucially, skip receives only 0.10 β€” the same as structurally incomplete JSON β€” rather than the 0.30 awarded to a fully valid assign. The format signal thus actively discourages skip relative to a well-formed assignment action, providing a gradient against skip-spamming even before the environment reward is observed.

The dual reward is combined with weights:

rtotal=renvΓ—1.0+rformatΓ—0.2r_{\text{total}} = r_{\text{env}} \times 1.0 + r_{\text{format}} \times 0.2

4.3 Anti-Reward-Hacking Measures

Beyond reward normalisation and the format signal, we implement the following hardened defences:

  1. KL divergence penalty ($\beta = 0.04$) β€” prevents the policy from drifting to degenerate strategies absent from the reference model's support.
  2. Episode reset before each reward evaluation β€” the environment is reset to a fixed seed before every GRPO reward call, preventing carry-over state from biasing reward estimates across generations.
  3. Server-side state authority β€” task completion requires server-tracked effort countdown; instruction_following_score is computed server-side from the actual action sequence; tech debt is permanent and server-authoritative. None of these can be manipulated client-side.
  4. Tech debt permanent drag β€” each missed sprint task permanently reduces an affected developer's productivity by 2%. Short-horizon hacking (rush low-effort tasks for early reward) is self-defeating: the compounding productivity penalty makes it worse than disciplined sprint planning over the full 60-day horizon.
  5. Reward clamping to [0.01, 0.99] β€” the agent cannot achieve a 1.0 reward on any single action, removing trivial reward-maximisation shortcuts.

5. Inference Pipeline

The inference pipeline (inference_r2.py) implements the evaluation loop used at test time. It is deliberately decoupled from the training code β€” the system prompt and user prompt format are maintained as shared constants, and any drift between training and inference distributions is treated as a first-class bug.

5.1 Prompt Architecture

Each step constructs a user prompt with the following sections:

D{day}/60 S{sprint}/6 {days_left}d done={N} miss={N} inst={score:.2f} debt={N}
⚑FOLLOW: [I01] <instruction text[:50]> | [I02] <instruction text[:50]>
BACKLOG(βœ“=deps_ok): [T04]P1 back βœ“ D9 [T08]P1 back βœ“ D12 [T06]P1 back βœ— D18 +21
IN_PROG: [T01]β†’dev1 [T02]β†’dev2 [T05]β†’dev4
DEVS(avail): [dev1]Alic(bac) 5/5 [dev3]Caro(dev) 0/5 [dev5]Eve(bac) 0/5
EPISODE_MEMORY:
  NO_REASSIGN_UNTIL_BACKLOG: T01 T02 T05
  ASSIGNED_OK_THIS_EP: T01 T04
  ASSIGN_ALREADY_TRIED: T08
JSON:

The EPISODE_MEMORY block is inference-only β€” it encodes within-episode state that the model cannot derive from the current observation alone. This is not in the training prompts, so it acts as a soft corrective signal rather than a learned dependency.

5.2 Action Validation Gate

Before any LLM-proposed action is submitted to the environment, it passes through validate_llm_action, a hard-gate rule checker that enforces:

  • assign requires status=backlog, met dependency DAG, available developer with matching or fullstack skill, and sufficient remaining capacity.
  • reassign requires status ∈ {backlog, in_progress} β€” a crucial fix from the baseline, where reassign was erroneously gated behind status=backlog only, causing every LLM reassign suggestion to be silently rejected.
  • unblock requires status=blocked and all dependencies in done.
  • reprioritize requires a valid priority integer in [1, 5].

Actions failing validation are not submitted to the environment β€” instead, smart_fallback is invoked.

5.3 Smart Fallback Policy

The rule-based fallback policy implements a three-tier priority hierarchy:

  1. Unblock β€” any blocked task whose dependency DAG is fully satisfied is immediately unblocked. This is always correct and never penalised.
  2. Reprioritize β€” any backlog task referenced in an active instruction with priority > 2 is reprioritised to P1 before assignment. This improves inst_score without consuming an assignment slot.
  3. Assign β€” the highest-priority backlog task with met dependencies and a skill-matched available developer. A critical fix here is the recently_failed exclusion set: tasks that were attempted but not confirmed as in_progress (i.e., server rejected the assign) are excluded from the first-pass candidate list. Without this, the fallback would retry the same rejected task on every step, producing the T15-four-times and T23-eight-times penalty loops visible in the baseline logs.

5.4 Adaptive LLM Cooldown

If the LLM produces invalid actions on three consecutive steps, it is placed in a 15-step cooldown, during which the fallback policy takes over. This prevents a stuck LLM from wasting the entire episode issuing penalised bad assigns while the fallback could have made forward progress. The cooldown is lifted after 15 steps, giving the model another chance β€” its contextual understanding of the episode state may improve as more observations accumulate.


6. Dataset Construction

Training data is collected via rollouts of the smart_fallback policy against the live environment. We deliberately sample from the middle of episodes (skipping the first 1–2 steps) rather than from step 0. The first step presents a full backlog with no instructions yet active β€” a trivially easy state that, if over-represented in the training distribution, teaches the model to assign tasks without any instruction-awareness. Sampling from steps 3–8 exposes the model to partially-completed episodes, active instructions, dependency constraints, and developer load variation.

SKIP_STEPS_R2 = 2    # advance past trivial initial state before sampling
SAMPLE_PER_EP = 6    # states sampled per episode
n_episodes    = 200  # per difficulty tier

Each episode is wrapped in a try/except block so HuggingFace Space timeouts during data collection do not abort the entire dataset build β€” a practical robustness measure given the shared-compute environment.


7. Results and Analysis

After fine-tuning with the corrected pipeline (SFT warm-up + GRPO + dual reward + fixed inference):

Scenario Baseline (off-the-shelf) Fine-tuned
project_easy 0.26 β€”
project_medium 0.22 β€”
project_hard 0.35* β€”
Average 0.28 ongoing

*project_hard baseline terminated at step 5 due to an episode-abort bug in the state regression handler (now fixed with skip-based recovery).

The key qualitative shift expected post fine-tuning is: the agent should move from the observed pattern of 23 consecutive skip actions (steps 24–46 in the baseline project_easy run) to sustained task assignment with instruction compliance β€” the dual reward and skip penalty are specifically designed to make this the dominant learned strategy.


8. Key Engineering Lessons

Prompt distribution alignment is load-bearing. The single largest performance predictor in LLM agent systems is whether the inference prompt exactly matches the training prompt. Even small differences in system prompt wording, observation serialisation format, or field ordering can cause a fine-tuned model to behave like the base model. We now treat the system prompt and build_user_prompt function as shared constants across both files, with any divergence treated as a regression.

Reward normalisation pathologies are easy to miss. The skip penalty case illustrates how a seemingly reasonable normalisation ($\text{clip}((r + 3)/5, 0, 1)$) can inadvertently place a suboptimal action above the neutral point, training the model to prefer it. Any reward normalisation scheme should be empirically inspected by plugging in every action type's typical reward and verifying the resulting normalised values are rank-ordered by desirability.

LoRA adapter serving requires local loading. Hosted inference APIs that do not support PEFT adapter composition will silently serve the base model even when a fine-tuned adapter is specified β€” without any error or warning. Verifying that the fine-tuned model is actually being served is a non-trivial step in production LLM pipelines.

GRPO collapses on homogeneous groups. With num_generations=2, group reward variance was consistently near zero at cold-start β€” all two completions produced equally malformed JSON and received the same reward. Increasing to num_generations=4 and adding the SFT warm-up phase are jointly necessary to establish non-degenerate GRPO training dynamics.


9. Future Work

  • Multi-step chain-of-thought reasoning. Current max completion length (96 tokens) allows only the raw JSON action. Allowing 256–512 tokens for scratchpad reasoning (<think>...</think> prefix) before the JSON output may improve performance on project_hard, where multi-sprint dependency planning requires forward simulation.
  • Curriculum sharpening. The current 2:2 R1:R2 ratio is fixed. An adaptive curriculum (e.g., increasing R2 fraction as R1 performance plateaus) may improve sample efficiency.
  • Offline reward modelling. The current setup requires live environment calls during GRPO reward evaluation. A learned reward model trained on recorded environment transitions could dramatically accelerate training by eliminating HTTP round-trips.
  • Larger backbone. Qwen2.5-1.5B was selected for T4 compatibility. On an A100, Qwen2.5-7B or Llama-3.1-8B with 4-bit QLoRA fits comfortably and may exhibit stronger long-horizon planning priors from pre-training.

Resources


References

  • Ouyang et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
  • Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.
  • DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
  • Dettmers et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
  • Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
  • Liu et al. (2024). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476.

Sign up or log in to comment