Spaces:

A-R-F
/

Agentic-Reliability-Framework-v4

Runtime error

App Files Files Community

petter2025 commited on about 1 month ago

Commit

fc72d83

verified ·

1 Parent(s): df9ea5d

Update README.md

Browse files

Files changed (1) hide show

README.md +99 -443

README.md CHANGED Viewed

@@ -1,453 +1,109 @@
 ---
-title: Agentic Reliability Framework
 emoji: 🧠
 colorFrom: blue
 colorTo: purple
 sdk: gradio
-sdk_version: "4.44.1"
 app_file: app.py
 pinned: false
-license: mit
-short_description: AI-powered reliability with multi-agent anomaly detection
 ---
-🧠 Agentic Reliability Framework (v2.0)
-Production-Grade Multi-Agent AI System for Autonomous Reliability Engineering
-Transform reactive monitoring into proactive reliability with AI agents that detect, diagnose, predict, and heal production issues autonomously.
-🚀 Live Demo • 📖 Documentation • 💬 Discussions • 📅 Consultation
-✨ What's New in v2.0
-🔒 Critical Security Patches
-CVE	Severity	Component	Status
-CVE-2025-23042	CVSS 9.1	Gradio <5.50.0 (Path Traversal)	✅ Patched
-CVE-2025-48889	CVSS 7.5	Gradio (DOS via SVG)	✅ Patched
-CVE-2025-5320	CVSS 6.5	Gradio (File Override)	✅ Patched
-CVE-2023-32681	CVSS 6.1	Requests (Credential Leak)	✅ Patched
-CVE-2024-47081	CVSS 5.3	Requests (.netrc leak)	✅ Patched
-Additional Security Hardening:
-✅ SHA-256 fingerprinting (replaced insecure MD5)
-✅ Comprehensive input validation with Pydantic v2
-✅ Rate limiting: 60 req/min per user, 500 req/hour global
-✅ Thread-safe atomic operations across all components
-⚡ Performance Breakthroughs
-70% Latency Reduction:
-Metric	Before	After	Improvement
-Event Processing (p50)	~350ms	~100ms	71% faster ⚡
-Event Processing (p99)	~800ms	~250ms	69% faster ⚡
-Agent Orchestration	Sequential	Parallel	3x faster 🚀
-Memory Growth	Unbounded	Bounded	Zero leaks 💾
-Key Optimizations:
-🔄 Native async handlers (removed event loop creation overhead)
-🧵 ProcessPoolExecutor for non-blocking ML inference
-💾 LRU eviction on all unbounded data structures
-🔒 Single-writer FAISS pattern (zero corruption, atomic saves)
-🎯 Lock-free reads where possible (reduced contention)
-🧪 Enterprise-Grade Testing
-✅ 40+ unit tests (87% coverage)
-✅ Thread safety verification (race condition detection)
-✅ Concurrency stress tests (10+ threads)
-✅ Memory leak detection (bounded growth verified)
-✅ Integration tests (end-to-end validation)
-✅ Performance benchmarks (latency tracking)
-🎯 Core Capabilities
-Three Specialized AI Agents Working in Concert:
-┌─────────────────────────────────────────────────────────────┐
-│                    Your Production System                    │
-│              (APIs, Databases, Microservices)                │
-└────────────────────────┬────────────────────────────────────┘
-                         │ Telemetry Stream
-                         ▼
-         ┌───────────────────────────────────┐
-         │   Agentic Reliability Framework   │
-         └───────────────────────────────────┘
-                         │
-              ┌──────────┼──────────┐
-              ▼          ▼          ▼
-        ┌─────────┐ ┌─────────┐ ┌─────────┐
-        │🕵️ Agent │ │🔍 Agent │ │🔮 Agent │
-        │Detective│ │ Diagnos-│ │Predict- │
-        │         │ │ tician  │ │ive      │
-        │Anomaly  │ │Root     │ │Future   │
-        │Detection│ │Cause    │ │Risk     │
-        └────┬────┘ └────┬────┘ └────┬────┘
-             │           │           │
-             └───────────┼───────────┘
-                         ▼
-              ┌──────────────────┐
-              │  Policy Engine   │
-              │  (Auto-Healing)  │
-              └──────────────────┘
-                         ▼
-              ┌──────────────────┐
-              │  Healing Actions │
-              │ • Restart        │
-              │ • Scale Out      │
-              │ • Rollback       │
-              │ • Circuit Break  │
-              └──────────────────┘
-🕵️ Detective Agent - Anomaly Detection
-Adaptive multi-dimensional scoring with 95%+ accuracy
-Real-time latency spike detection (adaptive thresholds)
-Error rate anomaly classification
-Resource exhaustion monitoring (CPU/Memory)
-Throughput degradation analysis
-Confidence scoring for all detections
-Example Output:
-Anomaly Detected
-Yes
-Confidence
-0.95
-Affected Metrics
-latency, error_rate, cpu
-Severity
-CRITICAL
-🔍 Diagnostician Agent - Root Cause Analysis
-Pattern-based intelligent diagnosis
-Identifies root causes through evidence correlation:
-🗄️ Database connection failures
-🔥 Resource exhaustion patterns
-🐛 Application bugs (error spike without latency)
-🌐 External dependency failures
-⚙️ Configuration issues
-Example Output:
-Root Causes
-Item 1
-Type
-Database Connection Pool Exhausted
-Confidence
-0.85
-Evidence
-high_latency, timeout_errors
-Recommendation
-Scale connection pool or add circuit breaker
-🔮 Predictive Agent - Time-Series Forecasting
-Lightweight statistical forecasting with 15-minute lookahead
-Predicts future system state using:
-Linear regression for trending metrics
-Exponential smoothing for volatile metrics
-Time-to-failure estimates
-Risk level classification
-Example Output:
-Forecasts
-Item 1
-Metric
-latency
-Predicted Value
-815.6
-Confidence
-0.82
-Trend
-increasing
-Time To Critical
-12 minutes
-Risk Level
-critical
-🚀 Quick Start
-Prerequisites
-Python 3.10+
-4GB RAM minimum (8GB recommended)
-2 CPU cores minimum (4 cores recommended)
-Installation
-# 1. Clone the repository
-git clone https://github.com/petterjuan/agentic-reliability-framework.git
 cd agentic-reliability-framework
-# 2. Create virtual environment
-python3.10 -m venv venv
-source venv/bin/activate  # Windows: venv\Scripts\activate
-# 3. Install dependencies
-pip install --upgrade pip
-pip install -r requirements.txt
-# 4. Verify security patches
-pip show gradio requests  # Check versions match requirements.txt
-# 5. Run tests (optional but recommended)
-pytest tests/ -v --cov
-# 6. Create data directories
-mkdir -p data logs tests
-# 7. Start the application
-python app.py
-Expected Output:
-2025-12-01 09:00:00 - INFO - Loading SentenceTransformer model...
-2025-12-01 09:00:02 - INFO - SentenceTransformer model loaded successfully
-2025-12-01 09:00:02 - INFO - Initialized ProductionFAISSIndex with 0 vectors
-2025-12-01 09:00:02 - INFO - Initialized PolicyEngine with 5 policies
-2025-12-01 09:00:02 - INFO - Launching Gradio UI on 0.0.0.0:7860...
-Running on local URL:  http://127.0.0.1:7860
-First Test Event
-Navigate to http://localhost:7860 and submit:
-Component: api-service
-Latency P99: 450 ms
-Error Rate: 0.25 (25%)
-Throughput: 800 req/s
-CPU Utilization: 0.88 (88%)
-Memory Utilization: 0.75 (75%)
-Expected Response:
-✅ Status: ANOMALY
-🎯 Confidence: 95.5%
-🔥 Severity: CRITICAL
-💰 Business Impact: $21.67 revenue loss, 5374 users affected
-🚨 Recommended Actions:
-  • Scale out resources (CPU/Memory critical)
-  • Check database connections (high latency)
-  • Consider rollback (error rate >20%)
-🔮 Predictions:
-  • Latency will reach 816ms in 12 minutes
-  • Error rate will reach 37% in 15 minutes
-  • System failure imminent without intervention
-📊 Key Features
-1️⃣ Real-Time Anomaly Detection
-Sub-100ms latency (p50) for event processing
-Multi-dimensional scoring across latency, errors, resources
-Adaptive thresholds that learn from your environment
-95%+ accuracy with confidence estimates
-2️⃣ Automated Healing Policies
-5 Built-in Policies:
-Policy	Trigger	Actions	Cooldown
-High Latency Restart	Latency >500ms	Restart + Alert	5 min
-Critical Error Rollback	Error rate >30%	Rollback + Circuit Breaker	10 min
-High Error Traffic Shift	Error rate >15%	Traffic Shift + Alert	5 min
-Resource Exhaustion Scale	CPU/Memory >90%	Scale Out	10 min
-Moderate Latency Circuit	Latency >300ms	Circuit Breaker	3 min
-Cooldown & Rate Limiting:
-Prevents action spam (e.g., restart loops)
-Per-policy, per-component cooldown tracking
-Rate limits: max 5-10 executions/hour per policy
-3️⃣ Business Impact Quantification
-Calculates real-time business metrics:
-💰 Estimated revenue loss (based on throughput drop)
-👥 Affected user count (from error rate × throughput)
-⏱️ Service degradation duration
-📉 SLO breach severity
-4️⃣ Vector-Based Incident Memory
-FAISS index stores 384-dimensional embeddings of incidents
-Semantic similarity search finds similar past issues
-Solution recommendation based on historical resolutions
-Thread-safe single-writer pattern with atomic saves
-5️⃣ Predictive Analytics
-Time-series forecasting with 15-minute lookahead
-Trend detection (increasing/decreasing/stable)
-Time-to-failure estimates
-Risk classification (low/medium/high/critical)
-🛠️ Configuration
-Environment Variables
-Create a .env file:
-# Optional: Hugging Face API token
-HF_TOKEN=your_hf_token_here
-# Data persistence
-DATA_DIR=./data
-INDEX_FILE=data/incident_vectors.index
-TEXTS_FILE=data/incident_texts.json
-# Application settings
-LOG_LEVEL=INFO
-MAX_REQUESTS_PER_MINUTE=60
-MAX_REQUESTS_PER_HOUR=500
-# Server
-HOST=0.0.0.0
-PORT=7860
-Custom Healing Policies
-Add your own policies in healing_policies.py:
-custom_policy = HealingPolicy(
-    name="custom_high_latency",
-    conditions=[
-        PolicyCondition(
-            metric="latency_p99",
-            operator="gt",
-            threshold=200.0
-        )
-    ],
-    actions=[
-        HealingAction.RESTART_CONTAINER,
-        HealingAction.ALERT_TEAM
-    ],
-    priority=1,
-    cool_down_seconds=300,
-    max_executions_per_hour=5,
-    enabled=True
-)
-🐳 Docker Deployment
-Dockerfile
-FROM python:3.10-slim
-WORKDIR /app
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    gcc g++ && \
-    rm -rf /var/lib/apt/lists/*
-# Copy and install Python dependencies
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-# Copy application
-COPY . .
-# Create directories
-RUN mkdir -p data logs
-EXPOSE 7860
-CMD ["python", "app.py"]
-Docker Compose
-version: '3.8'
-services:
-  arf:
-    build: .
-    ports:
-      - "7860:7860"
-    environment:
-      - HF_TOKEN=${HF_TOKEN}
-      - LOG_LEVEL=INFO
-    volumes:
-      - ./data:/app/data
-      - ./logs:/app/logs
-    restart: unless-stopped
-    deploy:
-      resources:
-        limits:
-          cpus: '4'
-          memory: 4G
-Run:
-docker-compose up -d
-🧪 Testing
-Run All Tests
-# Basic test run
-pytest tests/ -v
-# With coverage report
-pytest tests/ --cov --cov-report=html --cov-report=term-missing
-# Coverage summary
-# models.py                 95% coverage
-# healing_policies.py       90% coverage
-# app.py                    86% coverage
-# ──────────────────────────────────────
-# TOTAL                     87% coverage
-Test Categories
-# Unit tests
-pytest tests/test_models.py -v
-pytest tests/test_policy_engine.py -v
-# Thread safety tests
-pytest tests/test_policy_engine.py::TestThreadSafety -v
-# Integration tests
-pytest tests/test_input_validation.py -v
-📈 Performance Benchmarks
-Latency Breakdown (Intel i7, 16GB RAM)
-Component	Time (p50)	Time (p99)
-Input Validation	1.2ms	3.0ms
-Event Construction	4.8ms	10.0ms
-Detective Agent	18.3ms	35.0ms
-Diagnostician Agent	22.7ms	45.0ms
-Predictive Agent	41.2ms	85.0ms
-Policy Evaluation	19.5ms	38.0ms
-Vector Encoding	15.7ms	30.0ms
-Total	~100ms	~250ms
-Throughput
-Single instance: 100+ events/second
-With rate limiting: 60 events/minute per user
-Memory stable: ~250MB steady-state
-CPU usage: ~40-60% (4 cores)
-📚 Documentation
-📖 Technical Deep Dive - Architecture & algorithms
-🔌 API Reference - Complete API documentation
-🚀 Deployment Guide - Production deployment
-🧪 Testing Guide - Test strategy & coverage
-🤝 Contributing - How to contribute
-🗺️ Roadmap
-v2.1 (Next Release)
- Distributed FAISS index (multi-node scaling)
- Prometheus/Grafana integration
- Slack/PagerDuty notifications
- Custom alerting rules engine
-v3.0 (Future)
- Reinforcement learning for policy optimization
- LSTM-based forecasting
- Graph neural networks for dependency analysis
- Federated learning for cross-org knowledge sharing
-🤝 Contributing
-We welcome contributions! See CONTRIBUTING.md for guidelines.
-Ways to contribute:
-🐛 Report bugs or security issues
-💡 Propose new features or improvements
-📝 Improve documentation
-🧪 Add test coverage
-🔧 Submit pull requests
-📄 License
-MIT License - see LICENSE file for details.
-🙏 Acknowledgments
-Built with:
-Gradio - Web UI framework
-FAISS - Vector similarity search
-Sentence-Transformers - Semantic embeddings
-Pydantic - Data validation
-Inspired by:
-Production reliability challenges at Fortune 500 companies
-SRE best practices from Google, Netflix, Amazon
-📞 Contact & Support
-Author: Juan Petter (LGCY Labs)
-Email: petter2025us@outlook.com
-LinkedIn: linkedin.com/in/petterjuan
-Schedule Consultation: calendly.com/petter2025us/30min
-Need Help?
-🐛 Report a Bug
-💡 Request a Feature
-💬 Start a Discussion
-⭐ Show Your Support
-If this project helps you build more reliable systems, please consider:
-⭐ Starring this repository
-🐦 Sharing on social media
-📝 Writing a blog post about your experience
-💬 Contributing improvements back to the project
-📊 Project Statistics
-For utopia...For money.
-Production-grade reliability engineering meets AI automation.
-Key Improvements Made:
-✅ Better Structure - Clear sections with visual hierarchy
-✅ Security Focus - Detailed CVE table with severity scores
-✅ Performance Metrics - Before/after comparison tables
-✅ Visual Architecture - ASCII diagrams for clarity
-✅ Detailed Agent Descriptions - What each agent does with examples
-✅ Quick Start Guide - Step-by-step installation with expected outputs
-✅ Configuration Examples - .env file and custom policies
-✅ Docker Support - Complete deployment instructions
-✅ Performance Benchmarks - Real latency/throughput numbers
-✅ Testing Guide - How to run tests with coverage
-✅ Roadmap - Future plans clearly outlined
-✅ Contributing Section - Encourage community involvement
-✅ Contact Info - Multiple ways to get help

 ---
+title: ARF v4 – Reliability Lab
 emoji: 🧠
 colorFrom: blue
 colorTo: purple
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
 pinned: false
+license: apache-2.0
+short_description: ARF v4 – Bayesian reliability demo
 ---
+# 🧠 ARF v4 – Reliability Lab
+This Space hosts a live, interactive demo of the **Agentic Reliability Framework v4 (OSS edition)**. It showcases the core intelligence engine – a hybrid Bayesian + Hamiltonian Monte Carlo (HMC) system that evaluates infrastructure incidents and produces advisory healing recommendations.
+**All outputs are advisory only – no execution.**
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/petter2025us/agentic-reliability-framework)
+[![Tutorial](https://img.shields.io/badge/📘-Tutorial-green)](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
+[![Contact](https://img.shields.io/badge/📧-Email-yellow)](mailto:petter2025us@outlook.com)
+---
+## 🚀 How It Works
+The demo uses the `EnhancedReliabilityEngine` from the ARF v4 package. When you submit telemetry (component, latency, error rate, etc.), the engine:
+1. **Runs three specialised agents** in parallel:
+   - **Detective** – anomaly detection and pattern recognition
+   - **Diagnostician** – root cause analysis
+   - **Predictive** – forecasting and trend detection
+2. **Computes a risk score** using:
+   - Online **Bayesian conjugate priors** (Beta‑Binomial) per action category
+   - Offline **Hamiltonian Monte Carlo (HMC)** with NUTS (if trained) for complex patterns
+   - A weighted blend of both for the final score
+3. **Applies deterministic policy thresholds** (DPT) to recommend: `APPROVE`, `DENY`, or `ESCALATE`.
+4. **Returns a JSON** containing the risk score, agent insights, healing actions, and (if configured) a Claude‑generated executive summary.
+---
+## 🧪 Try It Yourself
+Just fill in the form on the left and click **Analyze**. The output will appear as a formatted JSON.
+Example input:
+- **Component**: `api-service`
+- **Latency P99**: `250 ms`
+- **Error Rate**: `0.08`
+- **Throughput**: `1000 req/s`
+- **CPU Utilization**: `0.7`
+- **Memory Utilization**: `0.6`
+Expected risk score: ~0.12 (low) → `APPROVE`.
+---
+## 📦 How This Space Is Built
+- **Base image**: `python:3.10` (via Dockerfile)
+- **Dependencies**:
+  - `git+https://github.com/petter2025us/agentic-reliability-framework.git@v4.0.0`
+  - `gradio>=4.0.0`
+- **Source code**: a minimal `app.py` that imports and runs `EnhancedReliabilityEngine`.
+All code is open source and available in the [main repository](https://github.com/petter2025us/agentic-reliability-framework).
+---
+## 🏃 Run Locally
+You can run the exact same demo on your own machine:
+```bash
+git clone https://github.com/petter2025us/agentic-reliability-framework.git
 cd agentic-reliability-framework
+python -m venv venv
+source venv/bin/activate
+pip install -e .
+pip install gradio
+python examples/app.py   # or copy the Space's app.py
+```
+Then open `http://localhost:7860`.
+---
+## 📚 Learn More
+- 📘 [Full Tutorial](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
+- 🐙 [GitHub Repository](https://github.com/petter2025us/agentic-reliability-framework)
+- 📖 [Contributing Guidelines](https://github.com/petter2025us/agentic-reliability-framework/blob/main/CONTRIBUTING.md)
+- 💼 [Enterprise Inquiries](mailto:petter2025us@outlook.com)
+---
+## 📬 Contact
+- **Email**: [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
+- **LinkedIn**: [petterjuan](https://linkedin.com/in/petterjuan)
+- **Book a call**: [Calendly – 30 min](https://calendly.com/petter2025us/30min)
+---
+*Powered by ARF v4 – Bayesian reliability for autonomous infrastructure.*