Spaces:
Sleeping
title: Adaptive Alert Triage & Incident Response
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
sdk_version: latest
python_version: '3.11'
pinned: false
app_port: 7860
Adaptive Alert Triage & Incident Response Environment (OpenEnv)
Version: 0.1.0
Framework: OpenEnv
Status: Alpha
Overview
An OpenEnv-compliant reinforcement learning environment that simulates real-time IT alert triage and incident response. Agents must intelligently prioritize alerts under resource constraints while preventing cascading system failures in a partially observable, dynamic environment.
Why RL Over Rule-Based Systems?
| Challenge | Rule-Based Limitation | RL Advantage |
|---|---|---|
| Dynamic Patterns | Static thresholds fail as alert patterns evolve | Learns from feedback, adapts to changing distributions |
| Context Awareness | Cannot capture alert correlations or temporal dependencies | Discovers hidden relationships through experience |
| Resource Optimization | Fixed allocation ignores varying system states | Optimizes action selection under real-time constraints |
| False Positive Handling | Uniform treatment leads to alert fatigue | Learns nuanced confidence signals and noise patterns |
| Cascading Failures | Reactive approach misses early warning signs | Proactive detection through predictive state modeling |
Environment Specification
State Space (Partial Observability)
Visible Features:
alerts: List of active alerts with:id: Unique alert identifiervisible_severity: Noisy severity score (0.0-1.0)confidence: Detection confidence (0.0-1.0)alert_type: Category (CPU, MEMORY, DISK, NETWORK, APPLICATION, SECURITY)age: Time steps since alert generation
system_load: Current system resource utilization (0.0-1.0)queue_length: Number of unprocessed alertstime_remaining: Steps left in episode
Hidden Features (ground truth for reward computation):
true_severity: Actual criticality of each alertcorrelations: Alert dependency graphfuture_failures: Predicted cascading failure probabilities
Action Space
Per alert, the agent can execute:
- INVESTIGATE: Allocate resources to diagnose (costly but resolves critical issues)
- IGNORE: Mark as noise (efficient for false positives)
- ESCALATE: Route to specialist team (high-confidence critical alerts)
- DELAY: Defer to next time step (queue management)
Resource Constraints: Maximum K investigations per time step (task-dependent).
Reward Structure
+10 # Critical alert correctly investigated
+5 # Cascading failure prevented through correlation detection
+3 # False positive correctly ignored
-2 # Unnecessary investigation (resource waste)
-8 # Missed critical alert
-10 # System failure due to ignored critical issue
Episode Dynamics
- Length: 20-50 time steps (task-dependent)
- Termination: Max steps reached OR failure threshold exceeded
- Alert Generation: Continuous stochastic process with temporal correlation
- Failure Mechanics: Ignored critical alerts accumulate damage, triggering cascading failures
Tasks
1. Easy: Basic Alert Prioritization
Objective: Correctly classify and handle alerts based on visible signals.
Success Criteria: β₯70% correct action rate
Key Challenge: Distinguish genuine critical alerts from noise
Grading: correct_actions / total_actions
2. Medium: Resource-Constrained Triage
Objective: Optimize triage under strict investigation limits.
Success Criteria: β₯65% weighted efficiency score
Key Challenge: Maximize critical alert resolution with limited resources
Grading: (weighted_resolved_alerts * resource_efficiency)
3. Hard: Cascading Failures Prevention
Objective: Detect correlated alerts and prevent future failures.
Success Criteria: β₯60% score with stability requirements
Key Challenge: Infer hidden correlations and predict failure chains
Grading: (prevented_failures - system_instability_penalty) / max_possible
Installation
Local Setup
# Clone repository
git clone https://github.com/scalar/adaptive-alert-triage.git
cd adaptive-alert-triage
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in editable mode
pip install -e .
Docker Setup
# Build Docker image
docker build -t adaptive-alert-triage:latest .
# Run validation
docker run --rm adaptive-alert-triage:latest
# Run evaluation with OpenAI API key
docker run --rm -e OPENAI_API_KEY=your_key adaptive-alert-triage:latest python evaluation/evaluate.py
Usage
Quick Start
from adaptive_alert_triage.env import AdaptiveAlertTriageEnv
from adaptive_alert_triage.models import Action
# Initialize environment with easy task
env = AdaptiveAlertTriageEnv(task_id="easy")
# Reset environment
observation = env.reset()
# Run episode
done = False
total_reward = 0
while not done:
# Example: investigate first alert
action = Action(
alert_id=observation.alerts[0].id,
action_type="INVESTIGATE"
)
observation, reward, done, info = env.step(action)
total_reward += reward.value
print(f"Episode reward: {total_reward}")
print(f"Task score: {info['task_score']}")
Running Baseline Agents
# Rule-based baseline
python agents/baseline.py --task easy
# OpenAI inference baseline (requires OPENAI_API_KEY)
export OPENAI_API_KEY=your_key_here
python agents/inference.py --task medium
Evaluation
# Run all baselines on all tasks
python evaluation/evaluate.py
# Generate comparison plots
python evaluation/plots.py
Testing
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=src/adaptive_alert_triage tests/
# Run specific test file
pytest tests/test_env.py -v
Docker + RL Server
The environment includes a production-ready FastAPI server for remote RL training.
Architecture
External World (Datadog/Kafka) ββPOST /ingest/alertsββ> Docker (FastAPI Server)
β
β Internal: AdaptiveAlertTriageEnv
β (real + synthetic alerts)
β
External RL Trainer (SB3) ββ/env/resetβββββββββββ> β <ββ/env/step(action)ββ Obs/Reward/Done
β
β
RL beats baselines! (0.61 β 0.82+)
Quick Start
# 1. Build and run the persistent RL server
docker compose up --build -d
# 2. Verify server health
curl http://localhost:8000/health
# 3. Send real alerts (simulate Datadog webhook)
bash scripts/demo_webhook.sh
# 4. Train external RL agent
pip install stable-baselines3
python train_external.py
# 5. View metrics
curl http://localhost:8000/metrics
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check (env_ready, queue_size) |
/metrics |
GET | RL score vs baseline comparison |
/ingest/alerts |
POST | Webhook receiver for Datadog/Kafka |
/env/reset/{task_id} |
POST | Initialize episode (easy/medium/hard) |
/env/step |
POST | Take RL action, receive obs/reward/done |
/env/state |
GET | Debug: current episode state |
/tasks |
GET | List available tasks |
/ws/train |
WS | Real-time streaming RL loop |
WebSocket Training
import websockets
import json
async with websockets.connect("ws://localhost:8000/ws/train") as ws:
# Reset
await ws.send(json.dumps({"type": "reset", "task_id": "hard"}))
obs = await ws.recv()
# Step loop
while True:
await ws.send(json.dumps({
"type": "step",
"action": {"alert_id": "A1", "action_type": "INVESTIGATE"}
}))
result = await ws.recv()
if json.loads(result)["done"]:
break
Project Structure
adaptive_alert_triage_openenv/
βββ README.md # This file
βββ pyproject.toml # Project metadata and dependencies
βββ openenv.yaml # OpenEnv specification
βββ Dockerfile # Container build instructions
βββ requirements.txt # Python dependencies
β
βββ src/adaptive_alert_triage/ # Core environment implementation
β βββ __init__.py
β βββ env.py # Main Gym environment
β βββ models.py # Pydantic Observation/Action/Reward models
β βββ utils.py # Helper functions
β
βββ tasks/ # Task definitions and graders
β βββ easy.py # Basic prioritization
β βββ medium.py # Resource-constrained triage
β βββ hard.py # Cascading failure prevention
β
βββ rewards/ # Reward shaping logic
β βββ reward.py
β
βββ agents/ # Baseline and example agents
β βββ baseline.py # Rule-based threshold agent
β βββ inference.py # OpenAI API baseline
β
βββ tests/ # Unit and integration tests
β βββ test_env.py
β βββ test_tasks.py
β βββ test_rewards.py
β
βββ evaluation/ # Performance analysis
β βββ evaluate.py # Run benchmarks
β βββ plots.py # Generate comparison charts
β
βββ docker/ # Docker utilities
βββ entrypoint.sh # Container startup script
OpenEnv Compliance
This environment adheres to the OpenEnv specification:
- β Pydantic models for Observation, Action, and Reward
- β
OpenEnv-compatible API (
reset(),step(),state()) - β Task-based evaluation with graders
- β Reproducible seeding
- β Docker containerization
- β
openenv.yamlmetadata
Contributing
Contributions are welcome! Please follow:
- Black code formatting (
black .) - Type hints for all functions
- Docstrings in Google style
- Unit tests for new features
License
MIT License - see LICENSE file for details.