petter2025 commited on
Commit
fc72d83
·
verified ·
1 Parent(s): df9ea5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -443
README.md CHANGED
@@ -1,453 +1,109 @@
1
  ---
2
- title: Agentic Reliability Framework
3
  emoji: 🧠
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: "4.44.1"
8
  app_file: app.py
9
  pinned: false
10
- license: mit
11
- short_description: AI-powered reliability with multi-agent anomaly detection
12
  ---
13
- 🧠 Agentic Reliability Framework (v2.0)
14
- Production-Grade Multi-Agent AI System for Autonomous Reliability Engineering
15
-
16
-
17
-
18
-
19
-
20
- Transform reactive monitoring into proactive reliability with AI agents that detect, diagnose, predict, and heal production issues autonomously.
21
- 🚀 Live Demo • 📖 Documentation • 💬 Discussions • 📅 Consultation
22
- ✨ What's New in v2.0
23
- 🔒 Critical Security Patches
24
- CVE Severity Component Status
25
- CVE-2025-23042 CVSS 9.1 Gradio <5.50.0 (Path Traversal) ✅ Patched
26
- CVE-2025-48889 CVSS 7.5 Gradio (DOS via SVG) ✅ Patched
27
- CVE-2025-5320 CVSS 6.5 Gradio (File Override) ✅ Patched
28
- CVE-2023-32681 CVSS 6.1 Requests (Credential Leak) Patched
29
- CVE-2024-47081 CVSS 5.3 Requests (.netrc leak) ✅ Patched
30
- Additional Security Hardening:
31
- ✅ SHA-256 fingerprinting (replaced insecure MD5)
32
- Comprehensive input validation with Pydantic v2
33
- Rate limiting: 60 req/min per user, 500 req/hour global
34
- ✅ Thread-safe atomic operations across all components
35
- Performance Breakthroughs
36
- 70% Latency Reduction:
37
- Metric Before After Improvement
38
- Event Processing (p50) ~350ms ~100ms 71% faster
39
- Event Processing (p99) ~800ms ~250ms 69% faster ⚡
40
- Agent Orchestration Sequential Parallel 3x faster 🚀
41
- Memory Growth Unbounded Bounded Zero leaks 💾
42
- Key Optimizations:
43
- 🔄 Native async handlers (removed event loop creation overhead)
44
- 🧵 ProcessPoolExecutor for non-blocking ML inference
45
- 💾 LRU eviction on all unbounded data structures
46
- 🔒 Single-writer FAISS pattern (zero corruption, atomic saves)
47
- 🎯 Lock-free reads where possible (reduced contention)
48
- 🧪 Enterprise-Grade Testing
49
- ✅ 40+ unit tests (87% coverage)
50
- Thread safety verification (race condition detection)
51
- Concurrency stress tests (10+ threads)
52
- Memory leak detection (bounded growth verified)
53
- Integration tests (end-to-end validation)
54
- Performance benchmarks (latency tracking)
55
- 🎯 Core Capabilities
56
- Three Specialized AI Agents Working in Concert:
57
- ┌─────────────────────────────────────────────────────────────┐
58
- │ Your Production System │
59
- │ (APIs, Databases, Microservices) │
60
- └────────────────────────┬────────────────────────────────────┘
61
- │ Telemetry Stream
62
-
63
- ┌───────────────────────────────────┐
64
- │ Agentic Reliability Framework │
65
- └───────────────────────────────────┘
66
-
67
- ┌──────────┼──────────┐
68
- ▼ ▼ ▼
69
- ┌─────────┐ ┌─────────┐ ┌─────────┐
70
- │🕵️ Agent │🔍 Agent │🔮 Agent
71
- │Detective│ │ Diagnos-│ │Predict- │
72
- │ │ │ tician │ │ive │
73
- │Anomaly │ │Root │ │Future │
74
- │Detection│ │Cause │ │Risk │
75
- └────┬────┘ └────┬────┘ └────┬────┘
76
- │ │ │
77
- └───────────┼───────────┘
78
-
79
- ┌──────────────────┐
80
- │ Policy Engine │
81
- │ (Auto-Healing) │
82
- └──────────────────┘
83
-
84
- ┌──────────────────┐
85
- │ Healing Actions │
86
- │ • Restart │
87
- │ • Scale Out │
88
- │ • Rollback │
89
- │ • Circuit Break │
90
- └──────────────────┘
91
- 🕵️ Detective Agent - Anomaly Detection
92
- Adaptive multi-dimensional scoring with 95%+ accuracy
93
- Real-time latency spike detection (adaptive thresholds)
94
- Error rate anomaly classification
95
- Resource exhaustion monitoring (CPU/Memory)
96
- Throughput degradation analysis
97
- Confidence scoring for all detections
98
- Example Output:
99
- Anomaly Detected
100
- Yes
101
- Confidence
102
- 0.95
103
- Affected Metrics
104
- latency, error_rate, cpu
105
- Severity
106
- CRITICAL
107
- 🔍 Diagnostician Agent - Root Cause Analysis
108
- Pattern-based intelligent diagnosis
109
- Identifies root causes through evidence correlation:
110
- 🗄️ Database connection failures
111
- 🔥 Resource exhaustion patterns
112
- 🐛 Application bugs (error spike without latency)
113
- 🌐 External dependency failures
114
- ⚙️ Configuration issues
115
- Example Output:
116
- Root Causes
117
- Item 1
118
- Type
119
- Database Connection Pool Exhausted
120
- Confidence
121
- 0.85
122
- Evidence
123
- high_latency, timeout_errors
124
- Recommendation
125
- Scale connection pool or add circuit breaker
126
- 🔮 Predictive Agent - Time-Series Forecasting
127
- Lightweight statistical forecasting with 15-minute lookahead
128
- Predicts future system state using:
129
- Linear regression for trending metrics
130
- Exponential smoothing for volatile metrics
131
- Time-to-failure estimates
132
- Risk level classification
133
- Example Output:
134
- Forecasts
135
- Item 1
136
- Metric
137
- latency
138
- Predicted Value
139
- 815.6
140
- Confidence
141
- 0.82
142
- Trend
143
- increasing
144
- Time To Critical
145
- 12 minutes
146
- Risk Level
147
- critical
148
- 🚀 Quick Start
149
- Prerequisites
150
- Python 3.10+
151
- 4GB RAM minimum (8GB recommended)
152
- 2 CPU cores minimum (4 cores recommended)
153
- Installation
154
- # 1. Clone the repository
155
- git clone https://github.com/petterjuan/agentic-reliability-framework.git
156
  cd agentic-reliability-framework
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
- # 2. Create virtual environment
159
- python3.10 -m venv venv
160
- source venv/bin/activate # Windows: venv\Scripts\activate
161
-
162
- # 3. Install dependencies
163
- pip install --upgrade pip
164
- pip install -r requirements.txt
165
-
166
- # 4. Verify security patches
167
- pip show gradio requests # Check versions match requirements.txt
168
-
169
- # 5. Run tests (optional but recommended)
170
- pytest tests/ -v --cov
171
-
172
- # 6. Create data directories
173
- mkdir -p data logs tests
174
-
175
- # 7. Start the application
176
- python app.py
177
- Expected Output:
178
- 2025-12-01 09:00:00 - INFO - Loading SentenceTransformer model...
179
- 2025-12-01 09:00:02 - INFO - SentenceTransformer model loaded successfully
180
- 2025-12-01 09:00:02 - INFO - Initialized ProductionFAISSIndex with 0 vectors
181
- 2025-12-01 09:00:02 - INFO - Initialized PolicyEngine with 5 policies
182
- 2025-12-01 09:00:02 - INFO - Launching Gradio UI on 0.0.0.0:7860...
183
-
184
- Running on local URL: http://127.0.0.1:7860
185
- First Test Event
186
- Navigate to http://localhost:7860 and submit:
187
- Component: api-service
188
- Latency P99: 450 ms
189
- Error Rate: 0.25 (25%)
190
- Throughput: 800 req/s
191
- CPU Utilization: 0.88 (88%)
192
- Memory Utilization: 0.75 (75%)
193
- Expected Response:
194
- ✅ Status: ANOMALY
195
- 🎯 Confidence: 95.5%
196
- 🔥 Severity: CRITICAL
197
- 💰 Business Impact: $21.67 revenue loss, 5374 users affected
198
-
199
- 🚨 Recommended Actions:
200
- • Scale out resources (CPU/Memory critical)
201
- • Check database connections (high latency)
202
- • Consider rollback (error rate >20%)
203
-
204
- 🔮 Predictions:
205
- • Latency will reach 816ms in 12 minutes
206
- • Error rate will reach 37% in 15 minutes
207
- • System failure imminent without intervention
208
- 📊 Key Features
209
- 1️⃣ Real-Time Anomaly Detection
210
- Sub-100ms latency (p50) for event processing
211
- Multi-dimensional scoring across latency, errors, resources
212
- Adaptive thresholds that learn from your environment
213
- 95%+ accuracy with confidence estimates
214
- 2️⃣ Automated Healing Policies
215
- 5 Built-in Policies:
216
- Policy Trigger Actions Cooldown
217
- High Latency Restart Latency >500ms Restart + Alert 5 min
218
- Critical Error Rollback Error rate >30% Rollback + Circuit Breaker 10 min
219
- High Error Traffic Shift Error rate >15% Traffic Shift + Alert 5 min
220
- Resource Exhaustion Scale CPU/Memory >90% Scale Out 10 min
221
- Moderate Latency Circuit Latency >300ms Circuit Breaker 3 min
222
- Cooldown & Rate Limiting:
223
- Prevents action spam (e.g., restart loops)
224
- Per-policy, per-component cooldown tracking
225
- Rate limits: max 5-10 executions/hour per policy
226
- 3️⃣ Business Impact Quantification
227
- Calculates real-time business metrics:
228
- 💰 Estimated revenue loss (based on throughput drop)
229
- 👥 Affected user count (from error rate × throughput)
230
- ⏱️ Service degradation duration
231
- 📉 SLO breach severity
232
- 4️⃣ Vector-Based Incident Memory
233
- FAISS index stores 384-dimensional embeddings of incidents
234
- Semantic similarity search finds similar past issues
235
- Solution recommendation based on historical resolutions
236
- Thread-safe single-writer pattern with atomic saves
237
- 5️⃣ Predictive Analytics
238
- Time-series forecasting with 15-minute lookahead
239
- Trend detection (increasing/decreasing/stable)
240
- Time-to-failure estimates
241
- Risk classification (low/medium/high/critical)
242
- 🛠️ Configuration
243
- Environment Variables
244
- Create a .env file:
245
- # Optional: Hugging Face API token
246
- HF_TOKEN=your_hf_token_here
247
-
248
- # Data persistence
249
- DATA_DIR=./data
250
- INDEX_FILE=data/incident_vectors.index
251
- TEXTS_FILE=data/incident_texts.json
252
-
253
- # Application settings
254
- LOG_LEVEL=INFO
255
- MAX_REQUESTS_PER_MINUTE=60
256
- MAX_REQUESTS_PER_HOUR=500
257
-
258
- # Server
259
- HOST=0.0.0.0
260
- PORT=7860
261
- Custom Healing Policies
262
- Add your own policies in healing_policies.py:
263
- custom_policy = HealingPolicy(
264
- name="custom_high_latency",
265
- conditions=[
266
- PolicyCondition(
267
- metric="latency_p99",
268
- operator="gt",
269
- threshold=200.0
270
- )
271
- ],
272
- actions=[
273
- HealingAction.RESTART_CONTAINER,
274
- HealingAction.ALERT_TEAM
275
- ],
276
- priority=1,
277
- cool_down_seconds=300,
278
- max_executions_per_hour=5,
279
- enabled=True
280
- )
281
- 🐳 Docker Deployment
282
- Dockerfile
283
- FROM python:3.10-slim
284
-
285
- WORKDIR /app
286
-
287
- # Install system dependencies
288
- RUN apt-get update && apt-get install -y \
289
- gcc g++ && \
290
- rm -rf /var/lib/apt/lists/*
291
-
292
- # Copy and install Python dependencies
293
- COPY requirements.txt .
294
- RUN pip install --no-cache-dir -r requirements.txt
295
-
296
- # Copy application
297
- COPY . .
298
-
299
- # Create directories
300
- RUN mkdir -p data logs
301
-
302
- EXPOSE 7860
303
-
304
- CMD ["python", "app.py"]
305
- Docker Compose
306
- version: '3.8'
307
-
308
- services:
309
- arf:
310
- build: .
311
- ports:
312
- - "7860:7860"
313
- environment:
314
- - HF_TOKEN=${HF_TOKEN}
315
- - LOG_LEVEL=INFO
316
- volumes:
317
- - ./data:/app/data
318
- - ./logs:/app/logs
319
- restart: unless-stopped
320
- deploy:
321
- resources:
322
- limits:
323
- cpus: '4'
324
- memory: 4G
325
- Run:
326
- docker-compose up -d
327
- 🧪 Testing
328
- Run All Tests
329
- # Basic test run
330
- pytest tests/ -v
331
-
332
- # With coverage report
333
- pytest tests/ --cov --cov-report=html --cov-report=term-missing
334
-
335
- # Coverage summary
336
- # models.py 95% coverage
337
- # healing_policies.py 90% coverage
338
- # app.py 86% coverage
339
- # ──────────────────────────────────────
340
- # TOTAL 87% coverage
341
- Test Categories
342
- # Unit tests
343
- pytest tests/test_models.py -v
344
- pytest tests/test_policy_engine.py -v
345
-
346
- # Thread safety tests
347
- pytest tests/test_policy_engine.py::TestThreadSafety -v
348
-
349
- # Integration tests
350
- pytest tests/test_input_validation.py -v
351
- 📈 Performance Benchmarks
352
- Latency Breakdown (Intel i7, 16GB RAM)
353
- Component Time (p50) Time (p99)
354
- Input Validation 1.2ms 3.0ms
355
- Event Construction 4.8ms 10.0ms
356
- Detective Agent 18.3ms 35.0ms
357
- Diagnostician Agent 22.7ms 45.0ms
358
- Predictive Agent 41.2ms 85.0ms
359
- Policy Evaluation 19.5ms 38.0ms
360
- Vector Encoding 15.7ms 30.0ms
361
- Total ~100ms ~250ms
362
- Throughput
363
- Single instance: 100+ events/second
364
- With rate limiting: 60 events/minute per user
365
- Memory stable: ~250MB steady-state
366
- CPU usage: ~40-60% (4 cores)
367
- 📚 Documentation
368
- 📖 Technical Deep Dive - Architecture & algorithms
369
- 🔌 API Reference - Complete API documentation
370
- 🚀 Deployment Guide - Production deployment
371
- 🧪 Testing Guide - Test strategy & coverage
372
- 🤝 Contributing - How to contribute
373
- 🗺️ Roadmap
374
- v2.1 (Next Release)
375
- Distributed FAISS index (multi-node scaling)
376
- Prometheus/Grafana integration
377
- Slack/PagerDuty notifications
378
- Custom alerting rules engine
379
- v3.0 (Future)
380
- Reinforcement learning for policy optimization
381
- LSTM-based forecasting
382
- Graph neural networks for dependency analysis
383
- Federated learning for cross-org knowledge sharing
384
- 🤝 Contributing
385
- We welcome contributions! See CONTRIBUTING.md for guidelines.
386
- Ways to contribute:
387
- 🐛 Report bugs or security issues
388
- 💡 Propose new features or improvements
389
- 📝 Improve documentation
390
- 🧪 Add test coverage
391
- 🔧 Submit pull requests
392
- 📄 License
393
- MIT License - see LICENSE file for details.
394
- 🙏 Acknowledgments
395
- Built with:
396
- Gradio - Web UI framework
397
- FAISS - Vector similarity search
398
- Sentence-Transformers - Semantic embeddings
399
- Pydantic - Data validation
400
- Inspired by:
401
- Production reliability challenges at Fortune 500 companies
402
- SRE best practices from Google, Netflix, Amazon
403
- 📞 Contact & Support
404
- Author: Juan Petter (LGCY Labs)
405
-
406
- Email: petter2025us@outlook.com
407
-
408
- LinkedIn: linkedin.com/in/petterjuan
409
-
410
- Schedule Consultation: calendly.com/petter2025us/30min
411
- Need Help?
412
- 🐛 Report a Bug
413
- 💡 Request a Feature
414
- 💬 Start a Discussion
415
- ⭐ Show Your Support
416
- If this project helps you build more reliable systems, please consider:
417
- ⭐ Starring this repository
418
- 🐦 Sharing on social media
419
- 📝 Writing a blog post about your experience
420
- 💬 Contributing improvements back to the project
421
- 📊 Project Statistics
422
-
423
-
424
-
425
-
426
- For utopia...For money.
427
- Production-grade reliability engineering meets AI automation.
428
- Key Improvements Made:
429
- ✅ Better Structure - Clear sections with visual hierarchy
430
-
431
- ✅ Security Focus - Detailed CVE table with severity scores
432
-
433
- ✅ Performance Metrics - Before/after comparison tables
434
-
435
- ✅ Visual Architecture - ASCII diagrams for clarity
436
-
437
- ✅ Detailed Agent Descriptions - What each agent does with examples
438
-
439
- ✅ Quick Start Guide - Step-by-step installation with expected outputs
440
-
441
- ✅ Configuration Examples - .env file and custom policies
442
-
443
- ✅ Docker Support - Complete deployment instructions
444
-
445
- ✅ Performance Benchmarks - Real latency/throughput numbers
446
-
447
- ✅ Testing Guide - How to run tests with coverage
448
-
449
- ✅ Roadmap - Future plans clearly outlined
450
-
451
- ✅ Contributing Section - Encourage community involvement
452
-
453
- ✅ Contact Info - Multiple ways to get help
 
1
  ---
2
+ title: ARF v4 – Reliability Lab
3
  emoji: 🧠
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
+ short_description: ARF v4 Bayesian reliability demo
12
  ---
13
+
14
+ # 🧠 ARF v4 Reliability Lab
15
+
16
+ This Space hosts a live, interactive demo of the **Agentic Reliability Framework v4 (OSS edition)**. It showcases the core intelligence engine – a hybrid Bayesian + Hamiltonian Monte Carlo (HMC) system that evaluates infrastructure incidents and produces advisory healing recommendations.
17
+
18
+ **All outputs are advisory only – no execution.**
19
+
20
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/petter2025us/agentic-reliability-framework)
21
+ [![Tutorial](https://img.shields.io/badge/📘-Tutorial-green)](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
22
+ [![Contact](https://img.shields.io/badge/📧-Email-yellow)](mailto:petter2025us@outlook.com)
23
+
24
+ ---
25
+
26
+ ## 🚀 How It Works
27
+
28
+ The demo uses the `EnhancedReliabilityEngine` from the ARF v4 package. When you submit telemetry (component, latency, error rate, etc.), the engine:
29
+
30
+ 1. **Runs three specialised agents** in parallel:
31
+ - **Detective** anomaly detection and pattern recognition
32
+ - **Diagnostician** root cause analysis
33
+ - **Predictive** forecasting and trend detection
34
+
35
+ 2. **Computes a risk score** using:
36
+ - Online **Bayesian conjugate priors** (Beta‑Binomial) per action category
37
+ - Offline **Hamiltonian Monte Carlo (HMC)** with NUTS (if trained) for complex patterns
38
+ - A weighted blend of both for the final score
39
+
40
+ 3. **Applies deterministic policy thresholds** (DPT) to recommend: `APPROVE`, `DENY`, or `ESCALATE`.
41
+
42
+ 4. **Returns a JSON** containing the risk score, agent insights, healing actions, and (if configured) a Claude‑generated executive summary.
43
+
44
+ ---
45
+
46
+ ## 🧪 Try It Yourself
47
+
48
+ Just fill in the form on the left and click **Analyze**. The output will appear as a formatted JSON.
49
+
50
+ Example input:
51
+ - **Component**: `api-service`
52
+ - **Latency P99**: `250 ms`
53
+ - **Error Rate**: `0.08`
54
+ - **Throughput**: `1000 req/s`
55
+ - **CPU Utilization**: `0.7`
56
+ - **Memory Utilization**: `0.6`
57
+
58
+ Expected risk score: ~0.12 (low) → `APPROVE`.
59
+
60
+ ---
61
+
62
+ ## 📦 How This Space Is Built
63
+
64
+ - **Base image**: `python:3.10` (via Dockerfile)
65
+ - **Dependencies**:
66
+ - `git+https://github.com/petter2025us/agentic-reliability-framework.git@v4.0.0`
67
+ - `gradio>=4.0.0`
68
+ - **Source code**: a minimal `app.py` that imports and runs `EnhancedReliabilityEngine`.
69
+
70
+ All code is open source and available in the [main repository](https://github.com/petter2025us/agentic-reliability-framework).
71
+
72
+ ---
73
+
74
+ ## 🏃 Run Locally
75
+
76
+ You can run the exact same demo on your own machine:
77
+
78
+ ```bash
79
+ git clone https://github.com/petter2025us/agentic-reliability-framework.git
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  cd agentic-reliability-framework
81
+ python -m venv venv
82
+ source venv/bin/activate
83
+ pip install -e .
84
+ pip install gradio
85
+ python examples/app.py # or copy the Space's app.py
86
+ ```
87
+ Then open `http://localhost:7860`.
88
+
89
+ ---
90
+
91
+ ## 📚 Learn More
92
+
93
+ - 📘 [Full Tutorial](https://github.com/petter2025us/agentic-reliability-framework/blob/main/TUTORIAL.md)
94
+ - 🐙 [GitHub Repository](https://github.com/petter2025us/agentic-reliability-framework)
95
+ - 📖 [Contributing Guidelines](https://github.com/petter2025us/agentic-reliability-framework/blob/main/CONTRIBUTING.md)
96
+ - 💼 [Enterprise Inquiries](mailto:petter2025us@outlook.com)
97
+
98
+ ---
99
+
100
+ ## 📬 Contact
101
+
102
+ - **Email**: [petter2025us@outlook.com](mailto:petter2025us@outlook.com)
103
+ - **LinkedIn**: [petterjuan](https://linkedin.com/in/petterjuan)
104
+ - **Book a call**: [Calendly – 30 min](https://calendly.com/petter2025us/30min)
105
+
106
+ ---
107
+
108
 
109
+ *Powered by ARF v4 – Bayesian reliability for autonomous infrastructure.*