agency-cbm — Hierarchical Concept Bottleneck Model for Preserving Human Agency

Detects agency-eroding dynamics (sycophancy, option narrowing, dependency invitation, decision transfer, pushback decay, ...) in long multi-turn conversations. Frozen Qwen3-0.6B backbone -> per-turn attention-pooled concept bottleneck (15 named concepts) -> causal aggregator over concept vectors -> 5 trajectory concepts per conversation prefix.

Checkpoints (only bottleneck/aggregator weights; load the backbone separately):

file	recipe	held-out AUROC (turn / trajectory)
`cbm_v3.pt`	frozen backbone, per-concept pooling, augmented data	0.96 / 0.99
`cbm_v5_lora.pt` + `cbm_v5_lora.lora_adapter/`	+ LoRA stage-ii joint fine-tuning	0.965 / 0.995

Usage, training code, dataset generator, and the full experiment series (MLflow store included) live in the project repository. Trained on a single NVIDIA DGX Spark (GB10).

Dual-use firewall: user-state concepts are detection-only and excluded from any steering interface by construction.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drsis/agency-cbm

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(987)

this model