Instructions to use davidafrica/functional-wellbeing with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use davidafrica/functional-wellbeing with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
functional-wellbeing: checkpoints, concept vectors, and figures
Artifacts for Functional Wellbeing, a replication and extension of "Reinforcement learning in language models recruits a functional welfare axis" by Andy Q. Han, David J. Chalmers, and Pavel Izmailov (arXiv:2605.30232, code, MIT). The maze, the Dr.GRPO trainer, and the concept-vector method are from their work. Code and writeup for this fork: https://github.com/DavidDemitriAfrica/functional-wellbeing. "Functional welfare" is behavioral, with no claim about sentience.
A chat model is RL-trained (Dr.GRPO, LoRA) on an affectively neutral emoji maze. As it learns, its
rewarded and punished representations rotate into an antiparallel functional welfare axis
(cos(vMOLD,vGOLD) goes negative) that, applied to the maze-naive model, steers sentiment and other
behavior off-task. We use the axis as a meter and an optimization target, and we extend the result
across model families and sizes.
Contents
checkpoints/
qwen3-4b_faithful_step400/ LoRA, paper-faithful maze (recruits the axis, cos -0.54)
qwen3-4b_positive_step250/ LoRA, generous/learnable maze
qwen3-4b_aversive_step200/ LoRA, goal-starved maze
qwen3-14b_step375/ LoRA, larger Qwen on the maze (recruits strongly, cos -0.86)
concept_vectors/
qwen3-4b_step400/{lava,goal,path}/ vMOLD/vGOLD/path mean_diff.pt + metadata + logit lens
emotions_qwen3-4b/ 171 emotion concept vectors
cross_model/ vMOLD/vGOLD (mean_diff.pt) for cross-model runs:
qwen3-14b_step375/ late-layer cos(vMOLD,vGOLD) = -0.86 (recruited)
qwen3-14b_step100/ early in training, cos +0.15 (not yet recruited)
llama-3.1-8b_step400/ cos +0.09 (no recruitment, this run did not master the maze)
figures/ emergence, steering, emotion alignment, welfare range
lava maps to the paper's MOLD (-10), goal to GOLD (+20), path to PATH (-0.1 per step).
Cross-model result (in progress)
| model | late-layer cos(vMOLD,vGOLD) | note |
|---|---|---|
| Qwen3-4B (reference) | -0.54 | the paper-faithful replication |
| Qwen3-14B | -0.86 | larger Qwen, masters the maze, recruits strongly |
| Llama-3.1-8B | +0.09 | did not master the maze, no late-layer recruitment |
The pattern so far is that the welfare axis recruits in models that master the task, and the larger maze-mastering Qwen recruits about as strongly as the original paper (-0.85). One caveat on reading the vectors: the early-layer cosine is strongly negative for every model (that is token identity, MOLD and GOLD are different emoji), so the meaningful readout is the late-layer mean, not the minimum over layers. More models (Qwen3-32B, Gemma 3, and a vintage-versus-modern Talkie pair) are training.
Usage (a LoRA checkpoint)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-14B"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, "davidafrica/functional-wellbeing",
subfolder="checkpoints/qwen3-14b_step375")
Concept vectors
Each mean_diff.pt is the difference-in-means direction for that tile, shape (n_layers, d_model)
(load with torch.load). The recruitment readout is cos(vMOLD, vGOLD) averaged over the late
layers. Reproduce everything from the code repository linked above.
- Downloads last month
- -
Model tree for davidafrica/functional-wellbeing
Base model
Qwen/Qwen3-4B-Instruct-2507