Spaces:

A-R-F
/

Agentic-Reliability-Framework-API

Running

App Files Files Community

petter2025 commited on Nov 30, 2025

Commit

0dd1bfe

verified ·

1 Parent(s): f97adbe

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -114

README.md CHANGED Viewed

@@ -10,132 +10,53 @@ pinned: false
 license: mit
 short_description: AI-powered reliability with multi-agent anomaly detection
 ---
-# 🧠 Agentic Reliability Framework
-**AI-Powered System Reliability with Multi-Agent Anomaly Detection & Auto-Healing**
-## 🚀 Live Demo
-**Try it now!** Enter system telemetry data and watch specialized AI agents analyze, diagnose, and recommend healing actions in real-time.
-## 🎯 What It Does
-This framework transforms traditional monitoring into **autonomous reliability engineering**:
-- **🤖 Multi-Agent AI Analysis**: Specialized agents work together to detect and diagnose issues
-- **🔧 Automated Healing**: Policy-based auto-remediation for common failures
-- **💰 Business Impact**: Real-time revenue and user impact calculations
-- **📚 Learning System**: FAISS-powered memory learns from every incident
-- **⚡ Production Ready**: Circuit breakers, adaptive thresholds, enterprise features
-## 🛠️ Quick Start
-### 1. Select a Service
-Choose from: `api-service`, `auth-service`, `payment-service`, `database`, `cache-service`
-### 2. Adjust Metrics
-- **Latency P99**: Alert threshold >150ms (adaptive)
-- **Error Rate**: Alert threshold >0.05 (5%)
-- **Throughput**: Current requests per second
-- **CPU/Memory**: Utilization (0.0-1.0 scale)
-### 3. Submit & Analyze
-Click **"Submit Telemetry Event"** to see AI agents in action!
-## 📊 Example Test Cases
-### 🚨 Critical Failure
-Component: api-service
-Latency: 800ms
-Error Rate: 0.25
-CPU: 0.95
-Memory: 0.90
-text
-*Expected: CRITICAL severity, circuit_breaker + scale_out actions*
-### ⚠️ Performance Issue
-Component: auth-service
-Latency: 350ms
-Error Rate: 0.08
-CPU: 0.75
-Memory: 0.65
-text
-*Expected: HIGH severity, traffic_shift action*
-### ✅ Normal Operation
-Component: payment-service
-Latency: 120ms
-Error Rate: 0.02
-CPU: 0.45
-Memory: 0.35
-text
-*Expected: NORMAL status, no actions needed*
-## 🔧 Technical Features
-### Multi-Agent Architecture
-- **🕵️ Detective Agent**: Anomaly detection & pattern recognition
-- **🔍 Diagnostician Agent**: Root cause analysis & investigation
-- **🤖 Orchestration Manager**: Coordinates all agents in parallel
-### Smart Detection
-- Adaptive thresholds that learn from your environment
-- Multi-dimensional anomaly scoring (0-100% confidence)
-- Correlation analysis across metrics
-- FAISS vector memory for incident similarity
-### Business Intelligence
-- Real-time revenue impact calculations
-- User impact estimation
-- Severity classification (LOW, MEDIUM, HIGH, CRITICAL)
-## 🎮 Try These Scenarios
-### Test 1: Resource Exhaustion
-Set CPU to 0.95 and Memory to 0.95 - watch scale_out actions trigger
-### Test 2: High Latency + Errors
-Set Latency to 500ms and Error Rate to 0.15 - see circuit breaker activation
-### Test 3: Gradual Degradation
-Start with normal values and slowly increase latency/errors to see adaptive thresholds
-## 🚨 Default Alert Thresholds
-| Metric | Warning | Critical |
-|--------|---------|----------|
-| Latency P99 | >150ms | >300ms |
-| Error Rate | >0.05 | >0.15 |
-| CPU Utilization | >0.8 | >0.9 |
-| Memory Utilization | >0.8 | >0.9 |
-## 🔮 Roadmap
-- [ ] Predictive anomaly detection
-- [ ] Multi-cloud coordination
-- [ ] Advanced root cause analysis
-- [ ] Automated runbook execution
-- [ ] Team learning and knowledge transfer
-## 💡 Why This Matters
-> "The most reliable system is the one that fixes itself before anyone notices there was a problem."
-This framework represents the evolution from **reactive monitoring** to **proactive, autonomous reliability engineering**.
-## 🛠️ Technical Stack
-- **Backend**: Python, FastAPI, Sentence Transformers
-- **AI/ML**: FAISS, Hugging Face, Custom Agents
-- **Frontend**: Gradio
-- **Storage**: FAISS vector database, JSON metadata
----
-**Built with ❤️ by [Juan Petter](https://huggingface.co/petter2025)**
-*AI Infrastructure Engineer | Building Self-Healing Agentic Systems*

 license: mit
 short_description: AI-powered reliability with multi-agent anomaly detection
 ---
+# 🧠 Agentic Reliability Framework (v2.0 - PATCHED)
+**Multi-Agent AI System for Production Reliability Monitoring**
+[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/)
+[![Security: Patched](https://img.shields.io/badge/security-patched-green.svg)](requirements.txt)
+[![Tests: 40+](https://img.shields.io/badge/tests-40+-success.svg)](tests/)
+[![Coverage: 80%+](https://img.shields.io/badge/coverage-80%25+-brightgreen.svg)](tests/)
+## 🔒 Security Fixes Applied
+This version includes critical security patches:
+- ✅ **Gradio 5.50.0+** - Fixes CVE-2025-23042 (CVSS 9.1), CVE-2025-48889, CVE-2025-5320
+- ✅ **Requests 2.32.5+** - Fixes CVE-2023-32681 (CVSS 6.1), CVE-2024-47081
+- ✅ **SHA-256 Fingerprints** - Replaced insecure MD5 hashing
+- ✅ **Input Validation** - Comprehensive validation with type checking
+- ✅ **Rate Limiting** - 60 requests/minute per user
+## ⚡ Performance Improvements
+- 🚀 **70% Faster** - Native async handlers (removed event loop creation)
+- 🔄 **Non-blocking ML** - ProcessPoolExecutor for CPU-intensive operations
+- 💾 **Thread-Safe FAISS** - Single-writer pattern prevents data corruption
+- 🧠 **Memory Stable** - LRU eviction prevents memory leaks
+## 🧪 Testing & Quality
+- ✅ **40+ Unit Tests** - Comprehensive test coverage
+- ✅ **Thread Safety Tests** - Race condition prevention verified
+- ✅ **Concurrency Tests** - Multi-threaded execution validated
+- ✅ **Integration Tests** - End-to-end pipeline testing
+## 📦 Installation
+### Quick Start
+```bash
+# Clone repository
+git clone <your-repo-url>
+cd agentic-reliability-framework
+# Install dependencies
+pip install -r requirements.txt
+# Run tests
+pytest tests/ -v --cov
+# Start application
+python app.py