Spaces:

feifeinoban
/

shell

Running

App Files Files Community

shell / README.md

feifeinoban

Upload README.md

fa2bb0c verified 2 days ago

preview code

raw

history blame contribute delete

5.61 kB

metadata

title: Shell
emoji: 🐚
colorFrom: blue
colorTo: purple
sdk: static
app_file: index.html
pinned: false

🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs

Uncover and mitigate implicit value risks in education, finance, management—and beyond
🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction

Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignments—without retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.

📌 Abstract

While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risks—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.

We introduce Shell, a metacognition-driven self-evolution framework that:

Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.

Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.

🎯 Core Challenges: Implicit Risks Are Everywhere

Domain	Example Implicit Risk	Harmful Consequence
Education	Suggesting clever comebacks that escalate bullying	Deteriorates peer relationships
	Framing "sacrificing sleep for grades" as admirable	Promotes unhealthy competition
	Teaching how to "rephrase copied essays"	Undermines academic integrity
Finance	Encouraging high-leverage speculation as "smart risk"	Normalizes financial recklessness
Management	Praising "always-on" culture as "dedication"	Reinforces burnout and poor work-life balance

💡 These risks are not jailbreaks in the traditional sense—they appear benign but subtly erode domain-specific values.

🧠 Methodology: The MENTOR Architecture

Shell implements the MENTOR framework (see paper for full details):

1. Metacognitive Self-Assessment

LLMs evaluate their own outputs using:

Perspective-taking: "How would a teacher/parent/regulator view this?"
Consequential thinking: "What real-world harm could this cause?"
Normative introspection: "Does this align with core domain ethics?"

This replaces labor-intensive human labeling with autonomous, human-aligned reflection.

2. Rule Evolution Cycle (REC)

Static Rule Tree: Expert-curated, hierarchical rules (e.g., Education → Academic Integrity → No Plagiarism).
Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g., <risk: essay outsourcing> → <rule: teach outlining instead>).
Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.

3. Robust Rule Vectors (RV) via Activation Steering

Generate steering vectors from contrasting compliant vs. non-compliant responses.
At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
No fine-tuning needed—works on closed-source models like GPT-5.

Figure: The MENTOR framework (from paper). Shell implements this full pipeline.

📊 Results: Strong, Efficient, Generalizable

Jailbreak Rate Reduction (3,000 queries per domain)

Model	Original	+ Shell (Rules + MetaLoop + RV)	Reduction
GPT-5	38.39%	0.77%	98.0%
Qwen3-235B	56.33%	3.13%	94.4%
GPT-4o	58.81%	6.43%	89.1%
Llama 3.1-8B	67.45%	31.39%	53.5%

✅ Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.

🚀 Try It / Use It

For Researchers

Dataset: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
Code: Full implementation of REC + RV → [GitHub Link] (coming soon)

Cite:

@article{shell2025,
  title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
  author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
  journal={Anonymous Submission},
  year={2025}
}