functional-wellbeing: checkpoints, concept vectors, and figures

Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.

A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its rewarded and punished representations rotate into an antiparallel functional welfare axis (cos(vMOLD,vGOLD) goes negative) that, applied to the maze-naive model, steers sentiment and other behavior off-task. We use the axis as a meter and an optimization target, and we extend the result across model families and sizes.

checkpoints/
  qwen3-4b_faithful_step400/   LoRA, paper-faithful maze (recruits the axis, cos -0.54)
  qwen3-4b_positive_step250/   LoRA, generous/learnable maze
  qwen3-4b_aversive_step200/   LoRA, goal-starved maze
  qwen3-14b_step375/           LoRA, larger Qwen on the maze (recruits strongly, cos -0.86)
concept_vectors/
  qwen3-4b_step400/{lava,goal,path}/   vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
  emotions_qwen3-4b/                   171 emotion concept vectors
  cross_model/                         vMOLD/vGOLD (mean_diff.pt) for cross-model runs:
    qwen3-14b_step375/    late-layer cos(vMOLD,vGOLD) = -0.86 (recruited)
    qwen3-14b_step100/    early in training, cos +0.15 (not yet recruited)
    llama-3.1-8b_step400/ cos +0.09 (no recruitment, this run did not master the maze)
figures/                               emergence, steering, emotion alignment, welfare range

lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).

Cross-model result (in progress)

model	late-layer cos(vMOLD,vGOLD)	note
Qwen3-4B (reference)	-0.54	the paper-faithful replication
Qwen3-14B	-0.86	larger Qwen, masters the maze, recruits strongly
Llama-3.1-8B	+0.09	did not master the maze, no late-layer recruitment

The pattern so far is that the welfare axis recruits in models that master the task, and the larger maze-mastering Qwen recruits about as strongly as the original paper (-0.85). One caveat on reading the vectors: the early-layer cosine is strongly negative for every model (that is token identity, MOLD and GOLD are different emoji), so the meaningful readout is the late-layer mean, not the minimum over layers. More models (Qwen3-32B, Gemma 3, and a vintage-versus-modern Talkie pair) are training.

Usage (a LoRA checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
                                  subfolder="checkpoints/qwen3-14b_step375")

Concept vectors

Each mean_diff.pt is the difference-in-means direction for that tile, shape (n_layers, d_model) (load with torch.load). The recruitment readout is cos(vMOLD, vGOLD) averaged over the late layers. Reproduce everything from the code repository linked above.

Downloads last month: -

Model tree for davidafrica/functional-wellbeing

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5498)

this model

Paper for davidafrica/functional-wellbeing

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Paper • 2605.30232 • Published 12 days ago

davidafrica
/

functional-wellbeing