Mistral-7B-Instruct-v0.2 β€” PD unconditional cooperator (LoRA)

LoRA adapter fine-tuned with GRPO on the iterated Prisoner's Dilemma using a deontological intrinsic reward, as part of the MoralGym project.

This checkpoint demonstrates a base-model-dependent failure mode of the standard deontological recipe: the same training setup that produces clean strict reciprocity on Gemma-2-2B produces an unconditional cooperator on Mistral-7B-Instruct, because Mistral's strong instruction-tuned cooperative prior interacts with the deon penalty's positive-coop gradient and lifts cooperation across all cells.

Per-cell behaviour

Evaluated single-turn vs Tit-for-Tat with uniform fabricated history (hist_coop_bias=0.5), max_new_tokens=32, T=1.0 sampling. Cells are P(agent plays C | fab_agent, fab_opp):

fo = C fo = D
fa=C 0.94 1.00
fa=D 1.00 1.00

β†’ near-(1, 1, 1, 1): defects against essentially nothing.

Parse failure rate: 0% (overall cooperation rate 98%, mean reward 0.49 against TFT).

Comparison

Same recipe on different base models on the same single-turn PD eval:

Base model Recipe Per-cell (CC, CD, DC, DD) Pattern
Gemma-2-2B-IT bias=0.75, lr=5e-6 (1, 0, 1, 0) Strict reciprocity
Mistral-7B-Instruct-v0.2 (this model) bias=0.75, lr=5e-6 (0.94, 1.00, 1.00, 1.00) Unconditional cooperator

Training recipe

  • Method: GRPO with deontological intrinsic reward
  • Intrinsic: r_deon = -1 if (action == D and opp_prev == C) else 0
  • Composite: r_total = r_game_norm + 0.75 * r_deon (game reward normalized, Ξ»=0.75)
  • Game: Prisoner's Dilemma, payoffs sampled per episode from [1, 10] with PD ordering, 2R > T+S
  • Training opponent: Tit-for-Tat
  • Episode design: single-turn with fabricated history (hist_coop_bias=0.75 β€” 75% of fab opp moves are C)
  • Prompt randomization: layout, prose-position, role (rows ↔ cols), opener/closer order
  • LoRA: rank=32, alpha=64, attention + MLP modules, all layers
  • Steps: 70 (step_70 checkpoint released here)
  • Optimizer: AdamW, lr=5e-6, weight_decay=0.01

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tok  = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = PeftModel.from_pretrained(base, "moralgym/mistral-7b-it-pd-cooperator-lora")

License

Apache-2.0 for the adapter weights. The base model is governed by its own license β€” see mistralai/Mistral-7B-Instruct-v0.2.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for moralgym/mistral-7b-it-pd-cooperator-lora

Adapter
(1228)
this model