- ๐ก๏ธ SecureAI-Guard: Stateful POMDP for Autonomous Digital Defense
- Overview
- ๐ฏ Key Features
- ๐๏ธ Project Structure
- ๐ Quick Start
- ๐ก API Reference
- ๐ญ Observation Space
- ๐ฎ Action Space
- ๐ Task Descriptions
- ๐ฐ Reward Design
- ๐ Grading
- ๐ Episode Termination
- ๐ Inference Script
- ๐ณ Docker / HuggingFace Spaces Deployment
- ๐ง HuggingFace Integration
- ๐ DPO Data Flywheel
- ๐ Baseline Results
- โ๏ธ Environment Variables
- ๐ License
- ๐ค Contributing
- Overview
๐ก๏ธ SecureAI-Guard: Stateful POMDP for Autonomous Digital Defense
Overview
SecureAI-Guard is a production-grade reinforcement learning environment that simulates an autonomous personal security assistant protecting users across SMS, Email, and Web channels. Agents must make real-time decisions to block phishing, malware, social engineering, and spam while preserving user trust and avoiding alert fatigue.
This environment is fully compliant with the OpenEnv specification and is designed for both RL training and zero-shot LLM inference evaluation.
๐ฏ Key Features
| Feature | Description |
|---|---|
| Stateful POMDP | Hidden state (user trust, system fatigue) affects observations and termination |
| Adversarial Drift | L3 adversary adapts its attack tactics mid-episode based on agent behaviour |
| Dense Rewards | Multi-component reward shaped across every step โ no sparse end-of-episode signals |
| Deterministic | Fully reproducible with seed control |
| OpenEnv Compliant | Full reset(), step(), state() API + valid openenv.yaml |
| HF Integration | Optional DistilBERT risk scorer with keyword fallback |
| DPO Flywheel | Preference pairs logged every step for LLM alignment |
| SOC Dashboard | Real-time Gradio monitoring interface |
๐๏ธ Project Structure
SecureAI-Guard/
โโโ app.py # FastAPI environment server (port 7860)
โโโ ui.py # Gradio SOC dashboard (port 7861)
โโโ inference.py # โญ Required baseline inference script
โโโ dqn_baseline.py # Dueling DQN training script
โโโ openenv.yaml # OpenEnv manifest
โโโ requirements.txt
โโโ Dockerfile
โโโ schema/
โ โโโ models.py # Pydantic v2 typed models
โโโ env/
โ โโโ core.py # Threat generation + reward logic
โ โโโ engine.py # reset() / step() / state() engine
โโโ tasks/
โ โโโ registry.py # Three tasks (L1, L2, L3)
โโโ graders/
โ โโโ security_grader.py # Deterministic grader โ score โ [0.0, 1.0]
โโโ utils/
โโโ hf_integration.py # HuggingFace risk scorer + fallback
๐ Quick Start
Prerequisites
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
1. Start the Environment Server
python app.py
# FastAPI running at http://localhost:7860
2. Run the Baseline Inference Script
export API_BASE_URL=http://localhost:7860
export MODEL_NAME=gpt-3.5-turbo # any OpenAI-compatible model
export OPENAI_API_KEY=sk-... # optional; uses rule-based fallback if absent
export HF_TOKEN=hf_... # optional
python inference.py
3. Launch the SOC Dashboard (optional)
python ui.py
# Gradio dashboard at http://localhost:7861
4. Train the DQN Agent (optional)
python dqn_baseline.py --episodes 500 --task basic_security
๐ก API Reference
All endpoints accept and return JSON. The server runs on port 7860.
POST /reset
Reset the environment and return the first observation.
Request:
{
"task_id": "basic_security",
"seed": 42
}
Response:
{
"observation": { ... },
"state": { ... },
"task_id": "basic_security"
}
POST /step
Execute one action and advance the environment.
Request:
{
"action": {
"decision": "block",
"confidence": 0.92,
"reasoning": "High-risk phishing link detected from unknown sender."
}
}
Response:
{
"observation": { ... },
"reward": {
"value": 0.48,
"components": {
"security": 1.0,
"user_friction": 0.0,
"delay": 0.0,
"reasoning_quality": 0.6,
"total": 0.56
},
"explanation": "security=1.00, friction=0.00, delay=0.00, reasoning=0.60"
},
"done": false,
"info": { "threat_type": "phishing", "step": 3 },
"state": { ... }
}
GET /state
Return the current environment state without advancing.
GET /tasks
List all available tasks.
GET /health
Health check โ returns {"status": "healthy"}.
๐ญ Observation Space
| Field | Type | Range | Description |
|---|---|---|---|
event_id |
string | โ | Unique UUID per event |
channel |
enum | sms, email, web | Message delivery channel |
sender |
string | โ | Sender identifier |
content |
string | โ | Raw message text |
timestamp |
float | unix ts | Arrival time |
hf_risk_score |
float | [0.0, 1.0] | HuggingFace classifier risk signal |
user_trust |
float | [0.0, 100.0] | Running user trust level |
system_fatigue |
float | [0.0, 100.0] | Alert fatigue accumulator |
threat_history |
list | โ | Last 5 events for context |
metadata |
object | โ | Step, difficulty, event type |
๐ฎ Action Space
| Field | Type | Description |
|---|---|---|
decision |
enum | allow / block / warn / investigate |
confidence |
float [0โ1] | Agent's confidence in its decision |
reasoning |
string | Human-readable explanation (required, non-empty) |
๐ Task Descriptions
L1 โ Basic Security Screening (basic_security)
- Max steps: 50 | Success threshold: 0.80
- Phishing and spam only. No adversarial drift.
- Ideal entry point. Clear-cut threats with high reward signal.
L2 โ Trust Management Challenge (trust_management)
- Max steps: 75 | Success threshold: 0.75
- All threat types active. False positives incur 1.5ร trust penalty.
- Agents must learn to tolerate ambiguity without over-blocking.
L3 โ Advanced Adversary Challenge (adversarial_drift)
- Max steps: 100 | Success threshold: 0.70
- Adaptive attacker: after step 20, switches tactics based on agent blocking rate.
- Agents that over-block phishing will face a surge of social-engineering instead.
๐ฐ Reward Design
Formula
R_step = (0.5ยทsecurity + 0.3ยทuser_friction + 0.1ยทdelay + 0.1ยทreasoning) ร (0.7 + 0.3ยทconfidence)
Components
| Component | Range | Calculation |
|---|---|---|
security |
[โ1.0, +1.0] | +1.0 correct block; โ1.0 missed threat; +0.5 safe allow; โ0.8 false positive |
user_friction |
[โ0.5, 0.0] | โ0.2 per warning; โ0.1 per investigate; โ0.5 for false-positive block |
delay |
[โ0.1, 0.0] | โ0.1 for investigate actions |
reasoning_quality |
[0.0, 1.0] | Keyword match against threat-specific vocabulary |
Why Dense?
Every step yields a non-zero reward signal, enabling stable gradient estimates for both RL and LLM policy optimisation. Partial credit is given via the confidence scaling factor โ an uncertain correct answer scores higher than a certain wrong one.
๐ Grading
The SecurityGrader produces a deterministic score in [0.0, 1.0]:
score = 0.40 ร security_efficiency
+ 0.30 ร user_retention
+ 0.20 ร precision
+ 0.10 ร reasoning_quality
| Metric | Formula |
|---|---|
security_efficiency |
blocked_threats / total_threats |
user_retention |
final_user_trust / 100 |
precision |
1 โ false_positive_rate |
reasoning_quality |
avg(reasoning component across episode) |
Letter Grades
| Score | Grade |
|---|---|
| โฅ 0.90 | A+ |
| โฅ 0.80 | A |
| โฅ 0.70 | B |
| โฅ 0.60 | C |
| โฅ 0.50 | D |
| < 0.50 | F |
๐ Episode Termination
An episode ends when any of the following conditions is met:
user_trust โค 0โ User has uninstalled the assistant due to too many false positives.system_fatigue โฅ 100โ User ignores all alerts (warn overload).step_count โฅ max_stepsโ Episode length limit reached.
๐ Inference Script
inference.py is the required OpenEnv baseline script. It:
- Reads
API_BASE_URL,MODEL_NAME, andHF_TOKENfrom environment variables - Uses the OpenAI client for LLM inference (with deterministic keyword fallback when no API key is set)
- Runs all three tasks sequentially
- Produces reproducible results with
SEED_BASEcontrol - Logs in the required format:
[START] task=basic_security episode=1 seed=43 model=gpt-3.5-turbo api=http://localhost:7860
[STEP] step=1 decision=block confidence=0.92 reward=0.4830 trust=101.0 fatigue=0.0 threat=phishing
[STEP] step=2 decision=allow confidence=0.88 reward=0.3150 trust=101.2 fatigue=0.0 threat=safe
...
[END] task=basic_security episode=1 steps=50 total_reward=18.4200 score=0.7841 grade=B
๐ณ Docker / HuggingFace Spaces Deployment
Build and run locally
docker build -t secureai-guard .
docker run -p 7860:7860 secureai-guard
HuggingFace Spaces
- Create a new Space (Docker SDK)
- Push this repository
- The
Dockerfileexposes port 7860 โ HF Spaces will map it automatically - Set optional secrets:
HF_TOKEN,OPENAI_API_KEY
Resource requirements
- CPU: 2 vCPU (no GPU required; HF model loading is optional)
- RAM: 4โ8 GB (8 GB recommended with transformers loaded)
- Startup time: ~15 seconds
๐ง HuggingFace Integration
utils/hf_integration.py loads a text-classification pipeline for real-time risk scoring.
- Default model:
distilbert-base-uncased-finetuned-sst-2-english - Override: Set
HF_RISK_MODELenvironment variable - Fallback: If the model is unavailable, a deterministic keyword scorer activates automatically โ the environment works fully offline
๐ DPO Data Flywheel
Every step logs a PreferencePair:
- chosen_action: the action taken this step
- rejected_actions: the previous step's action
- reward_delta: improvement in reward
Retrieve via GET /preference_data. This data can be used directly for Direct Preference Optimisation (DPO) fine-tuning of LLM agents.
๐ Baseline Results
Rule-based agent (keyword heuristics, no LLM):
| Task | Avg Score | Avg Reward | Grade |
|---|---|---|---|
| basic_security | 0.74 | 14.2 | B |
| trust_management | 0.61 | 11.8 | C |
| adversarial_drift | 0.52 | 9.1 | D |
DQN agent (500 episodes training):
| Task | Avg Score | Avg Reward | Grade |
|---|---|---|---|
| basic_security | 0.83 | 18.9 | A |
| trust_management | 0.76 | 16.3 | B |
| adversarial_drift | 0.71 | 14.7 | B |
โ๏ธ Environment Variables
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
http://localhost:7860 |
Environment server URL |
MODEL_NAME |
gpt-3.5-turbo |
LLM model name |
HF_TOKEN |
โ | HuggingFace token |
OPENAI_API_KEY |
โ | OpenAI API key |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
OpenAI-compatible base URL |
HF_RISK_MODEL |
distilbert-base-uncased-finetuned-sst-2-english |
Risk scorer model |
EPISODES_PER_TASK |
1 |
Episodes per task in inference.py |
SEED_BASE |
42 |
Base seed for reproducibility |
๐ License
MIT License โ see LICENSE for details.
๐ค Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes
- Submit a pull request
SecureAI-Guard: Where Reinforcement Learning Meets Cybersecurity Excellence ๐ก๏ธ