Risk RL Lab Qwen3 4B SFT LoRA Adapter

This repository contains a PEFT LoRA/QLoRA adapter trained for action selection in a deterministic Python Risk-compatible environment. The adapter is intended to emit one strict JSON object that selects from the prompt's candidate actions.

Output Contract

The model-facing contract is exactly:

{"action_index": 0}

action_index refers to the compact prompt-action list, not directly to a raw environment action. Runtime code maps that index back to the full legal action and the environment validates it. Invalid JSON, missing action_index, out-of-range indices, and illegal actions must be logged and replaced by FallbackSafeAgent.

Training Summary

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Method: Unsloth 4-bit LoRA/QLoRA SFT, adapter-only save
  • Training rows: 49,000
  • Held-out validation rows: 1,000
  • Epochs: 1.0
  • Sequence length: 2048
  • LoRA rank/alpha: 16 / 16
  • Train loss: 0.1535
  • Held-out eval loss: 0.1226
  • Hardware: NVIDIA RTX A5000

Training command:

HF_HOME=/workspace/.hf_home python -m training.train_sft_unsloth --dataset data/sft/risk_sft_stratified_50k.jsonl --model Qwen/Qwen3-4B-Instruct-2507 --out models/adapters/qwen3_4b_risk_sft --max-steps -1 --num-train-epochs 1 --limit-rows 0 --validation-split 0.02 --split-seed 3407 --eval-steps 1000 --logging-steps 50

Benchmark Results

Evaluation Rows Strict JSON Valid Index Teacher Match Invalid
Base model fixed prompt set 100 0.000 0.000 0.000 100
Adapter fixed prompt set 100 0.850 0.850 0.820 15
Adapter held-out validation set 1000 0.779 0.779 0.722 221

Evaluation command:

HF_HOME=/workspace/.hf_home python -m training.benchmark_policy --dataset data/sft/risk_sft_stratified_50k_val.jsonl --model models/adapters/qwen3_4b_risk_sft --out data/prefs/benchmark_heldout.json --limit 1000 --seed 3407

Linked Artifacts

Limitations

This is a research adapter for a simplified Risk-compatible environment. It is not a standalone base model and it is not a guarantee of optimal play. The environment should remain authoritative and validate every selected action.

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for clarkkitchen22/qwen3-4b-risk-sft-lora

Adapter
(5496)
this model

Dataset used to train clarkkitchen22/qwen3-4b-risk-sft-lora