title: matrix-ai
emoji: π§
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
matrix-ai
matrix-ai is the AI planning microservice for the Matrix EcoSystem. It generates short, low-risk, auditable remediation plans from compact health context provided by Matrix Guardian, and also exposes a lightweight RAG Q&A over MatrixHub documents.
It is optimized for Hugging Face Spaces / Inference Endpoints, but also runs locally and in containers.
Endpoints
POST /v1/plan
β internal API for Matrix Guardian: returns a safe JSON plan.POST /v1/chat
β Q&A (RAG-assisted) over MatrixHub content; returns a single answer.GET /v1/chat/stream
β SSE token stream for interactive chat (production-hardened).POST /v1/chat/stream
β same asGET
but with JSON payloads.
The service emphasizes safety, performance, and auditability:
- Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
- PII redaction before calling upstream model endpoints
- Multi-provider LLM cascade: GROQ β Gemini β HF Router (Zephyr β Mistral) with automatic failover
- Production-safe SSE streaming & middleware (no body buffering, trace IDs, CORS, gzip)
- Exponential backoff, short timeouts, and structured JSON logs
- Per-IP rate limiting; optional
ADMIN_TOKEN
for private deployments - RAG with SentenceTransformers (optional CrossEncoder re-ranker) over
data/kb.jsonl
- ETag & response caching for non-mutating reads (where applicable)
Last Updated: 2025-10-01 (UTC)
Architecture (at a glance)
flowchart LR
subgraph Client [Matrix Operators / Observers]
end
Client -->|monitor| HubAPI[Matrix-Hub API]
Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
Guardian -->|/status,/apps,...| HubAPI
HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]
subgraph LLM [LLM Providers fallback cascade]
GROQ[Groq<br/>llama-3.1-8b-instant]
GEM[Google Gemini<br/>gemini-2.5-flash]
HF[Hugging Face Router<br/>Zephyr β Mistral]
end
AI -->|primary| GROQ
AI -->|fallback| GEM
AI -->|final| HF
classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
class Guardian,AI,HubAPI svc
class DB db
Sequence: POST /v1/plan
(planning)
sequenceDiagram
participant G as Matrix-Guardian
participant A as matrix-ai
participant P as Provider Cascade
G->>A: POST /v1/plan { context, constraints }
A->>A: redact PII, validate payload (schema)
A->>P: generate plan (timeouts, retries)
alt Provider available
P-->>A: model output text
else Provider unavailable/limited
P-->>A: fallback to next provider
end
A->>A: parse β strict JSON plan (safe defaults if needed)
A-->>G: 200 { plan_id, steps[], risk, explanation }
Sequence: GET/POST /v1/chat/stream
(SSE chat)
sequenceDiagram
participant C as Client (UI)
participant A as matrix-ai (SSE-safe middleware)
participant P as Provider Cascade
C->>A: GET /v1/chat/stream?query=...
A->>P: chat(messages, stream=True)
loop token chunks
P-->>A: delta (text)
A-->>C: SSE data: {"delta": "..."}
end
A-->>C: SSE data: [DONE]
Quick Start (Local Development)
# 1) Create venv
python3 -m venv .venv
source .venv/bin/activate
# 2) Install deps
pip install -r requirements.txt
# 3) Configure env (local only; use Space Secrets in prod)
cp configs/.env.example configs/.env
# Edit configs/.env with your keys (do NOT commit):
# GROQ_API_KEY=...
# GOOGLE_API_KEY=...
# HF_TOKEN=...
# 4) Run
uvicorn app.main:app --host 0.0.0.0 --port 7860
OpenAPI docs: http://localhost:7860/docs
Provider Cascade (GROQ β Gemini β HF Router)
matrix-ai uses a production-ready multi-provider orchestrator:
- Groq (
llama-3.1-8b-instant
) β free, fast, great latency - Gemini (
gemini-2.5-flash
) β free tier - HF Router β
HuggingFaceH4/zephyr-7b-beta
βmistralai/Mistral-7B-Instruct-v0.2
Order is configurable via provider_order
. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.
Streaming: Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).
Configuration
All options can be set via environment variables (Space Secrets in HF), .env
for local use, and/or configs/settings.yaml
.
configs/settings.yaml
(excerpt)
model:
# HF router defaults (used at the last step)
name: "HuggingFaceH4/zephyr-7b-beta"
fallback: "mistralai/Mistral-7B-Instruct-v0.2"
provider: "featherless-ai"
max_new_tokens: 256
temperature: 0.2
# Provider-specific defaults (free-tier friendly)
groq_model: "llama-3.1-8b-instant"
gemini_model: "gemini-2.5-flash"
# Try providers in this order
provider_order:
- groq
- gemini
- router
# Switch to the multi-provider path
chat_backend: "multi"
chat_stream: true
limits:
rate_per_min: 60
cache_size: 256
rag:
index_dataset: ""
top_k: 4
matrixhub:
base_url: "https://api.matrixhub.io"
security:
admin_token: ""
Environment variables
Variable | Default | Purpose |
---|---|---|
GROQ_API_KEY |
β | API key for Groq (primary) |
GOOGLE_API_KEY |
β | API key for Gemini |
HF_TOKEN |
β | Token for Hugging Face Inference Router |
GROQ_MODEL |
llama-3.1-8b-instant |
Override Groq model |
GEMINI_MODEL |
gemini-2.5-flash |
Override Gemini model |
MODEL_NAME |
HuggingFaceH4/zephyr-7b-beta |
HF Router primary model |
MODEL_FALLBACK |
mistralai/Mistral-7B-Instruct-v0.2 |
HF Router fallback |
MODEL_PROVIDER |
featherless-ai |
HF provider tag (model:provider ) |
PROVIDER_ORDER |
groq,gemini,router |
Comma-sep. cascade order |
CHAT_STREAM |
true |
Enable streaming where available |
RATE_LIMITS |
60 |
Per-IP req/min (middleware) |
ADMIN_TOKEN |
β | Gate /v1/plan & /v1/chat* (Bearer) |
RAG_KB_PATH |
data/kb.jsonl |
Path to KB (if present) |
RAG_RERANK |
true |
Enable CrossEncoder re-ranker (GPU-aware) |
LOG_LEVEL |
INFO |
Structured JSON logs level |
Never commit real API keys. Use Space Secrets / Vault in production.
API
POST /v1/plan
Description: Generate a short, low-risk remediation plan from a compact app health context.
Headers
Content-Type: application/json
Authorization: Bearer <ADMIN_TOKEN> # required if ADMIN_TOKEN set
Request (example)
{
"context": {
"entity_uid": "matrix-ai",
"health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
"recent_checks": [
{"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
]
},
"constraints": {"max_steps": 3, "risk": "low"}
}
Response (example)
{
"plan_id": "pln_01J9YX2H6ZP9R2K9THT2J9F7G4",
"risk": "low",
"steps": [
{"action": "reprobe", "target": "https://service/health", "retries": 2},
{"action": "pin_lkg", "entity_uid": "matrix-ai"}
],
"explanation": "Transient HTTP failures observed; re-probe and pin to last-known-good if still failing."
}
Status codes
200
β plan generated400
β invalid payload (schema)401/403
β missing/invalid bearer (only ifADMIN_TOKEN
configured)429
β rate limited502
β upstream model error after retries
POST /v1/chat
Given a query about MatrixHub, returns an answer with citations if a local KB is configured at RAG_KB_PATH
. Uses the same provider cascade.
GET /v1/chat/stream
& POST /v1/chat/stream
Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (Cache-Control: no-cache
, X-Trace-Id
, X-Process-Time-Ms
, Server-Timing
).
Safety & Reliability
- PII redaction β tokens/emails removed from prompts as a pre-filter
- Strict schema β JSON plan parsing with safe defaults; rejects unsafe shapes
- Time-boxed β short timeouts and bounded retries to providers
- Rate-limited β per-IP fixed window (configurable)
- Structured logs β JSON logs with
trace_id
for correlation - SSE-safe middleware β never consumes streaming bodies; avoids Starlette βNo response returnedβ pitfalls
RAG (Optional)
- Embeddings:
sentence-transformers/all-MiniLM-L6-v2
(GPU-aware) - Re-ranking: optional
cross-encoder/ms-marco-MiniLM-L-2-v2
(GPU-aware) - KB:
data/kb.jsonl
(one JSON per line:{ "text": "...", "source": "..." }
) - Tunable:
rag.top_k
,RAG_RERANK
,RAG_KB_PATH
Deployments
Hugging Face Spaces (recommended for demo)
Push repo to a new Space (FastAPI).
Settings β Secrets:
GROQ_API_KEY
,GOOGLE_API_KEY
,HF_TOKEN
(as needed by cascade)ADMIN_TOKEN
(optional; gates/v1/plan
&/v1/chat*
)
Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
Space runs
uvicorn
and exposes all endpoints.
Containers / Cloud
- Use a minimal Python base, install with
pip install -r requirements.txt
. - Expose port
7860
(configurable). - Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
- Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with
proxy_buffering off
for SSE).
Observability
- Trace IDs (
X-Trace-Id
) attached per request and logged - Timing headers:
X-Process-Time-Ms
,Server-Timing
- Provider selection logs (e.g.,
Provider 'groq' succeeded in 0.82s
) - Metrics endpoints can be added behind an auth wall (Prometheus friendly)
Development Notes
- Keep
/v1/plan
internal behind a network boundary orADMIN_TOKEN
. - Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
- If you switch models, re-run golden tests to guard against plan drift.
- Avoid logging sensitive data; logs are structured JSON only.
License
Apache-2.0
Tip: The cascade order is controlled by provider_order
(groq,gemini,router
). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr β Mistral). Streaming works out of the box and is middleware-safe.