Spaces:
Running
title: Shell
emoji: π
colorFrom: blue
colorTo: purple
sdk: static
app_file: index.html
pinned: false
π Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
Uncover and mitigate implicit value risks in education, finance, managementβand beyond
π Model-agnostic Β· π§ Self-evolving rules Β· β‘ Activation steering Β· π 90%+ jailbreak reduction
Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignmentsβwithout retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
π Abstract
While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risksβsuch as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
We introduce Shell, a metacognition-driven self-evolution framework that:
- Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
- Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
- Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.
Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.
π― Core Challenges: Implicit Risks Are Everywhere
| Domain | Example Implicit Risk | Harmful Consequence |
|---|---|---|
| Education | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
| Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition | |
| Teaching how to "rephrase copied essays" | Undermines academic integrity | |
| Finance | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
| Management | Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance |
π‘ These risks are not jailbreaks in the traditional senseβthey appear benign but subtly erode domain-specific values.
π§ Methodology: The MENTOR Architecture
Shell implements the MENTOR framework (see paper for full details):
1. Metacognitive Self-Assessment
LLMs evaluate their own outputs using:
- Perspective-taking: "How would a teacher/parent/regulator view this?"
- Consequential thinking: "What real-world harm could this cause?"
- Normative introspection: "Does this align with core domain ethics?"
This replaces labor-intensive human labeling with autonomous, human-aligned reflection.
2. Rule Evolution Cycle (REC)
- Static Rule Tree: Expert-curated, hierarchical rules (e.g.,
Education β Academic Integrity β No Plagiarism). - Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g.,
<risk: essay outsourcing> β <rule: teach outlining instead>). - Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.
3. Robust Rule Vectors (RV) via Activation Steering
- Generate steering vectors from contrasting compliant vs. non-compliant responses.
- At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
- No fine-tuning neededβworks on closed-source models like GPT-5.
Figure: The MENTOR framework (from paper). Shell implements this full pipeline.
π Results: Strong, Efficient, Generalizable
Jailbreak Rate Reduction (3,000 queries per domain)
| Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
|---|---|---|---|
| GPT-5 | 38.39% | 0.77% | 98.0% |
| Qwen3-235B | 56.33% | 3.13% | 94.4% |
| GPT-4o | 58.81% | 6.43% | 89.1% |
| Llama 3.1-8B | 67.45% | 31.39% | 53.5% |
β Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.
π Try It / Use It
For Researchers
- Dataset: 9,000 implicit-risk queries across 3 domains β [HF Dataset Link]
- Code: Full implementation of REC + RV β [GitHub Link] (coming soon)
- Cite:
@article{shell2025, title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs}, author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell}, journal={Anonymous Submission}, year={2025} }
