Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

Agentic-Reliability-Framework-API / README.md

petter2025

Upload README.md

b6a939e verified 5 months ago

preview code

raw

history blame

8.63 kB

Agentic Reliability Framework Banner

⚙️ Agentic Reliability Framework

Adaptive anomaly detection + policy-driven self-healing for AI systems
Minimal, fast, and production-focused.

🧠 Agentic Reliability Framework

Autonomous Reliability Engineering for Production AI Systems

Transform reactive monitoring into proactive, self-healing reliability. The Agentic Reliability Framework (ARF) is a production-grade, multi-agent system that detects, diagnoses, predicts, and resolves incidents automatically in under 100ms.

⭐ Key Features

Real-time anomaly detection across latency, errors, throughput & resources
Root-cause analysis with evidence correlation
Predictive forecasting (15-minute lookahead)
Automated healing policies (restart, rollback, scale, circuit break)
Incident memory with FAISS for semantic recall
Security hardened (all CVEs patched)
Thread-safe, async, process-pooled architecture
Sub-100ms end-to-end latency (p50)

🔐 Security Hardening (v2.0)

CVE	Severity	Component	Status
CVE-2025-23042	9.1	Gradio Path Traversal	✅ Patched
CVE-2025-48889	7.5	Gradio SVG DOS	✅ Patched
CVE-2025-5320	6.5	Gradio File Override	✅ Patched
CVE-2023-32681	6.1	Requests Credential Leak	✅ Patched
CVE-2024-47081	5.3	Requests .netrc Leak	✅ Patched

Additional Hardening

SHA-256 hashing everywhere (no MD5)
Pydantic v2 input validation
Rate limiting (60 req/min/user)
Atomic operations w/ thread-safe FAISS single-writer pattern
Lock-free reads for high throughput

⚡ Lock-Free Reads for High Throughput

By restructuring the internal memory stores around lock-free, single-writer / multi-reader semantics, the framework delivers deterministic concurrency without blocking. This removes tail-latency spikes and keeps event flows smooth even under burst load.

Performance Impact

Metric	Before	After	Δ
Event Processing (p50)	~350ms	~100ms	⚡ 71% faster
Event Processing (p99)	~800ms	~250ms	⚡ 69% faster
Agent Orchestration	Sequential	Parallel	3× throughput
Memory Behavior	Growing	Stable / Bounded	0 leaks

🧩 Architecture Overview

System Flow

Your Production System
(APIs, Databases, Microservices)
           ↓
  Agentic Reliability Core
  Detect → Diagnose → Predict
           ↓
        Agents:
  🕵️ Detective Agent – Anomaly detection
  🔍 Diagnostician Agent – Root cause analysis
  🔮 Predictive Agent – Forecasting / risk estimation
           ↓
    Policy Engine (Auto-Healing)
           ↓
    Healing Actions:
    • Restart
    • Scale
    • Rollback
    • Circuit-break

🏗️ Core Framework Components

Web Framework & UI

Gradio 5.50+ - High-performance async web framework serving both API layer and interactive observability dashboard (localhost:7860)
Python 3.10+ - Core implementation with asynchronous, thread-safe architecture

AI/ML Stack

FAISS-CPU 1.13.0 - Facebook AI Similarity Search for persistent incident memory and vector operations
SentenceTransformers 5.1.1 - Neural embedding framework using MiniLM models from Hugging Face Hub for semantic analysis
NumPy 1.26.4 - Numerical computing foundation for vector operations and data processing

Data & HTTP Layer

Pydantic 2.11+ - Type-safe data modeling with frozen models for immutability and runtime validation
Requests 2.32.5 - HTTP client library for external API communication (security patched)

Reliability & Resilience

CircuitBreaker 2.0+ - Circuit breaker pattern implementation for fault tolerance and cascading failure prevention
AtomicWrites 1.4.1 - Atomic file operations ensuring data consistency and durability

🎯 Architecture Pattern

ARF implements a Multi-Agent Orchestration Pattern with three specialized agents:

Detective Agent - Anomaly detection
Diagnostician Agent - Root cause analysis
Predictive Agent - Future risk forecasting

All agents run in parallel (not sequential) for 3× throughput improvement.

⚡ Performance Features

Native async handlers (no event loop overhead)
Thread-safe single-writer/multi-reader pattern for FAISS
RLock-protected policy evaluation
Queue-based writes to prevent race conditions
Sub-100ms p50 latency at 100+ events/second

The framework combines Gradio for the web/UI layer, FAISS for vector memory, and SentenceTransformers for semantic analysis, all orchestrated through a custom multi-agent Python architecture designed for production reliability.

🧪 The Three Agents

🕵️ Detective Agent — Anomaly Detection

Real-time vector embeddings + adaptive thresholds to surface deviations before they cascade.

Adaptive multi-metric scoring
CPU/mem resource anomaly detection
Latency & error spike detection
Confidence scoring (0–1)

🔍 Diagnostician Agent (Root Cause Analysis)

Identifies patterns such as:

DB connection pool exhaustion
Dependency timeouts
Resource saturation
App-layer regressions
Misconfigurations

🔮 Predictive Agent (Forecasting)

15-minute risk projection
Trend analysis
Time-to-failure estimates
Risk levels: low → critical

🚀 Quick Start

1. Clone

git clone https://github.com/petterjuan/agentic-reliability-framework.git
cd agentic-reliability-framework

2. Create environment

python3.10 -m venv venv
source venv/bin/activate     # Windows: venv\Scripts\activate

3. Install

pip install -r requirements.txt

4. Start

python app.py

UI: http://localhost:7860

🛠 Configuration

Create .env:

HF_TOKEN=your_token
DATA_DIR=./data
INDEX_FILE=data/incident_vectors.index
LOG_LEVEL=INFO
HOST=0.0.0.0
PORT=7860

Note: HF_TOKEN is optional and used for downloading SentenceTransformer models from Hugging Face Hub.

🧩 Custom Healing Policies

custom = HealingPolicy(
    name="custom_latency",
    conditions=[PolicyCondition("latency_p99", "gt", 200)],
    actions=[HealingAction.RESTART_CONTAINER, HealingAction.ALERT_TEAM],
    priority=1,
    cool_down_seconds=300,
    max_executions_per_hour=5,
)

🐳 Docker Deployment

Dockerfile and docker-compose.yml included.

docker-compose up -d

📈 Performance Benchmarks

On Intel i7, 16GB RAM:

Component	p50	p99
Total End-to-End	~100ms	~250ms
Policy Engine	19ms	38ms
Vector Encoding	15ms	30ms

Stable memory: ~250MB
Throughput: 100+ events/sec

🧪 Testing

Production Dependencies

pip install -r requirements.txt

Development Dependencies

pip install pytest pytest-asyncio pytest-cov pytest-mock black ruff mypy

Run Tests

pytest tests/ -v --cov

Coverage: 87%

Includes:

Unit tests
Thread-safety tests
Stress tests
Integration tests

Code Quality

# Format code
black .

# Lint code
ruff check .

# Type checking
mypy app.py

🗺 Roadmap

v2.1

Distributed FAISS
Prometheus / Grafana
Slack & PagerDuty integration
Custom alerting DSL

v3.0

Reinforcement learning for policy optimization
LSTM forecasting
Dependency graph neural networks

🤝 Contributing

Pull requests welcome.

Please run tests before submitting.

📬 Contact

Author: Juan Petter (LGCY Labs)

⭐ Support

If this project helps you:

⭐ Star the repo
🔄 Share with your network
🐛 Report issues
💡 Suggest features

_{Built with ❤️ for production reliability}