petter2025's picture
Upload README.md
b6a939e verified
|
raw
history blame
8.63 kB

Agentic Reliability Framework Banner

⚙️ Agentic Reliability Framework

Adaptive anomaly detection + policy-driven self-healing for AI systems
Minimal, fast, and production-focused.

Python 3.10+ Status: MVP License: MIT

🧠 Agentic Reliability Framework

Autonomous Reliability Engineering for Production AI Systems

Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.

⭐ Key Features

  • Real-time anomaly detection across latency, errors, throughput & resources
  • Root-cause analysis with evidence correlation
  • Predictive forecasting (15-minute lookahead)
  • Automated healing policies (restart, rollback, scale, circuit break)
  • Incident memory with FAISS for semantic recall
  • Security hardened (all CVEs patched)
  • Thread-safe, async, process-pooled architecture
  • Sub-100ms end-to-end latency (p50)

🔐 Security Hardening (v2.0)

CVE Severity Component Status
CVE-2025-23042 9.1 Gradio Path Traversal ✅ Patched
CVE-2025-48889 7.5 Gradio SVG DOS ✅ Patched
CVE-2025-5320 6.5 Gradio File Override ✅ Patched
CVE-2023-32681 6.1 Requests Credential Leak ✅ Patched
CVE-2024-47081 5.3 Requests .netrc Leak ✅ Patched

Additional Hardening

  • SHA-256 hashing everywhere (no MD5)
  • Pydantic v2 input validation
  • Rate limiting (60 req/min/user)
  • Atomic operations w/ thread-safe FAISS single-writer pattern
  • Lock-free reads for high throughput

⚡ Lock-Free Reads for High Throughput

By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.

Performance Impact

Metric Before After Δ
Event Processing (p50) ~350ms ~100ms ⚡ 71% faster
Event Processing (p99) ~800ms ~250ms ⚡ 69% faster
Agent Orchestration Sequential Parallel 3× throughput
Memory Behavior Growing Stable / Bounded 0 leaks

🧩 Architecture Overview

System Flow

Your Production System
(APIs, Databases, Microservices)
           ↓
  Agentic Reliability Core
  Detect → Diagnose → Predict
           ↓
        Agents:
  🕵️ Detective Agent – Anomaly detection
  🔍 Diagnostician Agent – Root cause analysis
  🔮 Predictive Agent – Forecasting / risk estimation
           ↓
    Policy Engine (Auto-Healing)
           ↓
    Healing Actions:
    • Restart
    • Scale
    • Rollback
    • Circuit-break

🏗️ Core Framework Components

Web Framework & UI

  • Gradio 5.50+ - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
  • Python 3.10+ - Core implementation with asynchronous, thread-safe architecture

AI/ML Stack

  • FAISS-CPU 1.13.0 - Facebook AI Similarity Search for persistent incident memory and vector operations
  • SentenceTransformers 5.1.1 - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
  • NumPy 1.26.4 - Numerical computing foundation for vector operations and data processing

Data & HTTP Layer

  • Pydantic 2.11+ - Type-safe data modeling with frozen models for immutability and runtime validation
  • Requests 2.32.5 - HTTP client library for external API communication (security patched)

Reliability & Resilience

  • CircuitBreaker 2.0+ - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
  • AtomicWrites 1.4.1 - Atomic file operations ensuring data consistency and durability

🎯 Architecture Pattern

ARF implements a Multi-Agent Orchestration Pattern with three specialized agents:

  • Detective Agent - Anomaly detection
  • Diagnostician Agent - Root cause analysis
  • Predictive Agent - Future risk forecasting

All agents run in parallel (not sequential) for 3× throughput improvement.

⚡ Performance Features

  • Native async handlers (no event loop overhead)
  • Thread-safe single-writer/multi-reader pattern for FAISS
  • RLock-protected policy evaluation
  • Queue-based writes to prevent race conditions
  • Sub-100ms p50 latency at 100+ events/second

The framework combines Gradio for the web/UI layer, FAISS for vector memory, and SentenceTransformers for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.

🧪 The Three Agents

🕵️ Detective Agent — Anomaly Detection

Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.

  • Adaptive multi-metric scoring
  • CPU/mem resource anomaly detection
  • Latency & error spike detection
  • Confidence scoring (0–1)

🔍 Diagnostician Agent (Root Cause Analysis)

Identifies patterns such as:

  • DB connection pool exhaustion
  • Dependency timeouts
  • Resource saturation
  • App-layer regressions
  • Misconfigurations

🔮 Predictive Agent (Forecasting)

  • 15-minute risk projection
  • Trend analysis
  • Time-to-failure estimates
  • Risk levels: low → critical

🚀 Quick Start

1. Clone

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework

2. Create environment

python3.10 -m venv venv
source venv/bin/activate     # Windows: venv\Scripts\activate

3. Install

pip install -r requirements.txt

4. Start

python app.py

UI: http://localhost:7860

🛠 Configuration

Create .env:

HF_TOKEN=your_token
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
LOG_LEVEL=INFO
HOST=0.0.0.0
PORT=7860

Note: HF_TOKEN is optional and used for downloading SentenceTransformer models from Hugging Face Hub.

🧩 Custom Healing Policies

custom = HealingPolicy(
    name="custom_latency",
    conditions=[PolicyCondition("latency_p99", "gt", 200)],
    actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
    priority=1,
    cool_down_seconds=300,
    max_executions_per_hour=5,
)

🐳 Docker Deployment

Dockerfile and docker-compose.yml included.

docker-compose up -d

📈 Performance Benchmarks

On Intel i7, 16GB RAM:

Component p50 p99
Total End-to-End ~100ms ~250ms
Policy Engine 19ms 38ms
Vector Encoding 15ms 30ms

Stable memory: ~250MB
Throughput: 100+ events/sec

🧪 Testing

Production Dependencies

pip install -r requirements.txt

Development Dependencies

pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy

Run Tests

pytest tests/ -v --cov

Coverage: 87%

Includes:

  • Unit tests
  • Thread-safety tests
  • Stress tests
  • Integration tests

Code Quality

# Format code
black .

# Lint code
ruff check .

# Type checking
mypy app.py

🗺 Roadmap

v2.1

  • Distributed FAISS
  • Prometheus / Grafana
  • Slack & PagerDuty integration
  • Custom alerting DSL

v3.0

  • Reinforcement learning for policy optimization
  • LSTM forecasting
  • Dependency graph neural networks

🤝 Contributing

Pull requests welcome.

Please run tests before submitting.

📬 Contact

Author: Juan Petter (LGCY Labs)

⭐ Support

If this project helps you:

  • ⭐ Star the repo
  • 🔄 Share with your network
  • 🐛 Report issues
  • 💡 Suggest features

Built with ❤️ for production reliability