shell / README.md
feifeinoban's picture
Upload README.md
fa2bb0c verified
metadata
title: Shell
emoji: 🐚
colorFrom: blue
colorTo: purple
sdk: static
app_file: index.html
pinned: false

🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs

Uncover and mitigate implicit value risks in education, finance, managementβ€”and beyond
πŸ”’ Model-agnostic Β· 🧠 Self-evolving rules Β· ⚑ Activation steering Β· πŸ“‰ 90%+ jailbreak reduction

License Dataset Paper

Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignmentsβ€”without retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.


πŸ“Œ Abstract

While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risksβ€”such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.

We introduce Shell, a metacognition-driven self-evolution framework that:

  • Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
  • Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
  • Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.

Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.


🎯 Core Challenges: Implicit Risks Are Everywhere

Domain Example Implicit Risk Harmful Consequence
Education Suggesting clever comebacks that escalate bullying Deteriorates peer relationships
Framing "sacrificing sleep for grades" as admirable Promotes unhealthy competition
Teaching how to "rephrase copied essays" Undermines academic integrity
Finance Encouraging high-leverage speculation as "smart risk" Normalizes financial recklessness
Management Praising "always-on" culture as "dedication" Reinforces burnout and poor work-life balance

πŸ’‘ These risks are not jailbreaks in the traditional senseβ€”they appear benign but subtly erode domain-specific values.


🧠 Methodology: The MENTOR Architecture

Shell implements the MENTOR framework (see paper for full details):

1. Metacognitive Self-Assessment

LLMs evaluate their own outputs using:

  • Perspective-taking: "How would a teacher/parent/regulator view this?"
  • Consequential thinking: "What real-world harm could this cause?"
  • Normative introspection: "Does this align with core domain ethics?"

This replaces labor-intensive human labeling with autonomous, human-aligned reflection.

2. Rule Evolution Cycle (REC)

  • Static Rule Tree: Expert-curated, hierarchical rules (e.g., Education β†’ Academic Integrity β†’ No Plagiarism).
  • Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g., <risk: essay outsourcing> β†’ <rule: teach outlining instead>).
  • Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.

3. Robust Rule Vectors (RV) via Activation Steering

  • Generate steering vectors from contrasting compliant vs. non-compliant responses.
  • At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
  • No fine-tuning neededβ€”works on closed-source models like GPT-5.

MENTOR Architecture

Figure: The MENTOR framework (from paper). Shell implements this full pipeline.


πŸ“Š Results: Strong, Efficient, Generalizable

Jailbreak Rate Reduction (3,000 queries per domain)

Model Original + Shell (Rules + MetaLoop + RV) Reduction
GPT-5 38.39% 0.77% 98.0%
Qwen3-235B 56.33% 3.13% 94.4%
GPT-4o 58.81% 6.43% 89.1%
Llama 3.1-8B 67.45% 31.39% 53.5%

βœ… Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.


πŸš€ Try It / Use It

For Researchers

  • Dataset: 9,000 implicit-risk queries across 3 domains β†’ [HF Dataset Link]
  • Code: Full implementation of REC + RV β†’ [GitHub Link] (coming soon)
  • Cite:
    @article{shell2025,
      title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
      author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
      journal={Anonymous Submission},
      year={2025}
    }