Risk RL Lab Qwen3 4B SFT LoRA Adapter

This repository contains a PEFT LoRA/QLoRA adapter trained for action selection in a deterministic Python Risk-compatible environment. The adapter is intended to emit one strict JSON object that selects from the prompt's candidate actions.

Output Contract

The model-facing contract is exactly:

{"action_index": 0}

action_index refers to the compact prompt-action list, not directly to a raw environment action. Runtime code maps that index back to the full legal action and the environment validates it. Invalid JSON, missing action_index, out-of-range indices, and illegal actions must be logged and replaced by FallbackSafeAgent.

Training Summary

Base model: Qwen/Qwen3-4B-Instruct-2507
Method: Unsloth 4-bit LoRA/QLoRA SFT, adapter-only save
Training rows: 49,000
Held-out validation rows: 1,000
Epochs: 1.0
Sequence length: 2048
LoRA rank/alpha: 16 / 16
Train loss: 0.1535
Held-out eval loss: 0.1226
Hardware: NVIDIA RTX A5000

Training command:

HF_HOME=/workspace/.hf_home python -m training.train_sft_unsloth --dataset data/sft/risk_sft_stratified_50k.jsonl --model Qwen/Qwen3-4B-Instruct-2507 --out models/adapters/qwen3_4b_risk_sft --max-steps -1 --num-train-epochs 1 --limit-rows 0 --validation-split 0.02 --split-seed 3407 --eval-steps 1000 --logging-steps 50

Benchmark Results

Evaluation	Rows	Strict JSON	Valid Index	Teacher Match	Invalid
Base model fixed prompt set	100	0.000	0.000	0.000	100
Adapter fixed prompt set	100	0.850	0.850	0.820	15
Adapter held-out validation set	1000	0.779	0.779	0.722	221

Evaluation command:

HF_HOME=/workspace/.hf_home python -m training.benchmark_policy --dataset data/sft/risk_sft_stratified_50k_val.jsonl --model models/adapters/qwen3_4b_risk_sft --out data/prefs/benchmark_heldout.json --limit 1000 --seed 3407

Linked Artifacts

Training dataset: https://huggingface.co/datasets/clarkkitchen22/risk-rl-lab-sft
Benchmark data and metrics: https://huggingface.co/datasets/clarkkitchen22/risk-rl-lab-benchmark

Limitations

This is a research adapter for a simplified Risk-compatible environment. It is not a standalone base model and it is not a guarantee of optimal play. The environment should remain authoritative and validate every selected action.

Downloads last month: 29

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for clarkkitchen22/qwen3-4b-risk-sft-lora

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5496)

this model

clarkkitchen22
/

qwen3-4b-risk-sft-lora