dsc-co-grpo-lora

LoRA adapter trained with TRL GRPO + Unsloth for openenv-dsc-co, a 30-step supply-chain planning environment verified by a deterministic Pulp/CBC min-cost-flow oracle.

Links

Training Setup

  • Base model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
  • Method: 4-bit QLoRA, LoRA rank 32
  • Trainer: TRL GRPOTrainer
  • Samples: 2,000 prompts
  • Steps: 400
  • Generations per prompt: 8
  • Max completion length: 768
  • Runtime: 17,469.9 seconds / 4h 51m
  • Final train loss: -0.04913

Reward Evidence

metric first logged step final step best / aggregate
combined reward 0.622 1.304 max 1.365
cumulative env reward 0.505 0.852 last-25 mean 0.855
terminal MILP reward 0.052 0.226 max 0.241
reward std 0.387 0.079 frac_reward_zero_std=0 final
KL 0.000 0.0077 stable

The terminal reward is emitted only after the environment reaches the 30-step horizon and invokes the MILP verifier. Non-zero terminal reward throughout the run confirms the model-generated tool actions reached verified episode completion, not just dense shaping events.

Loading

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "AceofStades/dsc-co-grpo-lora",
    max_seq_length=8192,
    load_in_4bit=True,
    fast_inference=True,
)
FastLanguageModel.for_inference(model)

Notes

The model is a planning-policy adapter for the DSC environment. It is not a general-purpose assistant model. The environment and reward pipeline are documented in the source README and linked docs.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AceofStades/dsc-co-grpo-lora

Space using AceofStades/dsc-co-grpo-lora 1