Spaces:

ruslanmv
/

matrix-ai

Sleeping

App Files Files Community

matrix-ai / README.md

ruslanmv

Update README.md

d1c625a 12 days ago

preview code

raw

history blame contribute delete

11.3 kB

metadata

title: matrix-ai
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false

matrix-ai

matrix-ai is the AI planning microservice for the Matrix EcoSystem. It generates short, low-risk, auditable remediation plans from compact health context provided by Matrix Guardian, and also exposes a lightweight RAG Q&A over MatrixHub documents.

It is optimized for Hugging Face Spaces / Inference Endpoints, but also runs locally and in containers.

Endpoints

POST /v1/plan – internal API for Matrix Guardian: returns a safe JSON plan.

POST /v1/chat – Q&A (RAG-assisted) over MatrixHub content; returns a single answer.

GET /v1/chat/stream – SSE token stream for interactive chat (production-hardened).

POST /v1/chat/stream – same as GET but with JSON payloads.

The service emphasizes safety, performance, and auditability:

Strict, schema-validated JSON plans (bounded steps, risk label, rationale)
PII redaction before calling upstream model endpoints
Multi-provider LLM cascade: GROQ → Gemini → HF Router (Zephyr → Mistral) with automatic failover
Production-safe SSE streaming & middleware (no body buffering, trace IDs, CORS, gzip)
Exponential backoff, short timeouts, and structured JSON logs
Per-IP rate limiting; optional ADMIN_TOKEN for private deployments
RAG with SentenceTransformers (optional CrossEncoder re-ranker) over data/kb.jsonl
ETag & response caching for non-mutating reads (where applicable)

Last Updated: 2025-10-01 (UTC)

Architecture (at a glance)

flowchart LR
    subgraph Client [Matrix Operators / Observers]
    end

    Client -->|monitor| HubAPI[Matrix-Hub API]
    Guardian[Matrix-Guardian<br/>control plane] -->|/v1/plan| AI[matrix-ai<br/>FastAPI service]
    Guardian -->|/status,/apps,...| HubAPI
    HubAPI <-->|SQL| DB[MatrixDB<br/>Postgres]

    subgraph LLM [LLM Providers fallback cascade]
        GROQ[Groq<br/>llama-3.1-8b-instant]
        GEM[Google Gemini<br/>gemini-2.5-flash]
        HF[Hugging Face Router<br/>Zephyr → Mistral]
    end

    AI -->|primary| GROQ
    AI -->|fallback| GEM
    AI -->|final| HF

    classDef svc fill:#0ea5e9,stroke:#0b4,stroke-width:1,color:#fff
    classDef db fill:#f59e0b,stroke:#0b4,stroke-width:1,color:#fff
    class Guardian,AI,HubAPI svc
    class DB db

Sequence: `POST /v1/plan` (planning)

sequenceDiagram
    participant G as Matrix-Guardian
    participant A as matrix-ai
    participant P as Provider Cascade

    G->>A: POST /v1/plan { context, constraints }
    A->>A: redact PII, validate payload (schema)
    A->>P: generate plan (timeouts, retries)
    alt Provider available
        P-->>A: model output text
    else Provider unavailable/limited
        P-->>A: fallback to next provider
    end
    A->>A: parse → strict JSON plan (safe defaults if needed)
    A-->>G: 200 { plan_id, steps[], risk, explanation }

Sequence: `GET/POST /v1/chat/stream` (SSE chat)

sequenceDiagram
  participant C as Client (UI)
  participant A as matrix-ai (SSE-safe middleware)
  participant P as Provider Cascade

  C->>A: GET /v1/chat/stream?query=...
  A->>P: chat(messages, stream=True)
  loop token chunks
    P-->>A: delta (text)
    A-->>C: SSE data: {"delta": "..."}
  end
  A-->>C: SSE data: [DONE]

Quick Start (Local Development)

# 1) Create venv
python3 -m venv .venv
source .venv/bin/activate

# 2) Install deps
pip install -r requirements.txt

# 3) Configure env (local only; use Space Secrets in prod)
cp configs/.env.example configs/.env
# Edit configs/.env with your keys (do NOT commit):
# GROQ_API_KEY=...
# GOOGLE_API_KEY=...
# HF_TOKEN=...

# 4) Run
uvicorn app.main:app --host 0.0.0.0 --port 7860

OpenAPI docs: http://localhost:7860/docs

Provider Cascade (GROQ → Gemini → HF Router)

matrix-ai uses a production-ready multi-provider orchestrator:

Groq (llama-3.1-8b-instant) – free, fast, great latency
Gemini (gemini-2.5-flash) – free tier
HF Router – HuggingFaceH4/zephyr-7b-beta → mistralai/Mistral-7B-Instruct-v0.2

Order is configurable via provider_order. Providers are skipped automatically if misconfigured or if quotas/credits are exceeded.

Streaming: Groq streams true tokens; Gemini/HF may yield one chunk (normalized to SSE).

Configuration

All options can be set via environment variables (Space Secrets in HF), .env for local use, and/or configs/settings.yaml.

`configs/settings.yaml` (excerpt)

model:
  # HF router defaults (used at the last step)
  name: "HuggingFaceH4/zephyr-7b-beta"
  fallback: "mistralai/Mistral-7B-Instruct-v0.2"
  provider: "featherless-ai"
  max_new_tokens: 256
  temperature: 0.2

  # Provider-specific defaults (free-tier friendly)
  groq_model: "llama-3.1-8b-instant"
  gemini_model: "gemini-2.5-flash"

# Try providers in this order
provider_order:
  - groq
  - gemini
  - router

# Switch to the multi-provider path
chat_backend: "multi"
chat_stream: true

limits:
  rate_per_min: 60
  cache_size: 256

rag:
  index_dataset: ""
  top_k: 4

matrixhub:
  base_url: "https://api.matrixhub.io"

security:
  admin_token: ""

Environment variables

Variable	Default	Purpose
`GROQ_API_KEY`	—	API key for Groq (primary)
`GOOGLE_API_KEY`	—	API key for Gemini
`HF_TOKEN`	—	Token for Hugging Face Inference Router
`GROQ_MODEL`	`llama-3.1-8b-instant`	Override Groq model
`GEMINI_MODEL`	`gemini-2.5-flash`	Override Gemini model
`MODEL_NAME`	`HuggingFaceH4/zephyr-7b-beta`	HF Router primary model
`MODEL_FALLBACK`	`mistralai/Mistral-7B-Instruct-v0.2`	HF Router fallback
`MODEL_PROVIDER`	`featherless-ai`	HF provider tag (`model:provider`)
`PROVIDER_ORDER`	`groq,gemini,router`	Comma-sep. cascade order
`CHAT_STREAM`	`true`	Enable streaming where available
`RATE_LIMITS`	`60`	Per-IP req/min (middleware)
`ADMIN_TOKEN`	—	Gate `/v1/plan` & `/v1/chat*` (Bearer)
`RAG_KB_PATH`	`data/kb.jsonl`	Path to KB (if present)
`RAG_RERANK`	`true`	Enable CrossEncoder re-ranker (GPU-aware)
`LOG_LEVEL`	`INFO`	Structured JSON logs level

Never commit real API keys. Use Space Secrets / Vault in production.

API

`POST /v1/plan`

Description: Generate a short, low-risk remediation plan from a compact app health context.

Headers

Content-Type: application/json
Authorization: Bearer <ADMIN_TOKEN>   # required if ADMIN_TOKEN set

Request (example)

{
  "context": {
    "entity_uid": "matrix-ai",
    "health": {"score": 0.64, "status": "degraded", "last_checked": "2025-10-01T00:00:00Z"},
    "recent_checks": [
      {"check": "http", "result": "fail", "latency_ms": 900, "ts": "2025-10-01T00:00:00Z"}
    ]
  },
  "constraints": {"max_steps": 3, "risk": "low"}
}

Response (example)

{
  "plan_id": "pln_01J9YX2H6ZP9R2K9THT2J9F7G4",
  "risk": "low",
  "steps": [
    {"action": "reprobe", "target": "https://service/health", "retries": 2},
    {"action": "pin_lkg", "entity_uid": "matrix-ai"}
  ],
  "explanation": "Transient HTTP failures observed; re-probe and pin to last-known-good if still failing."
}

Status codes

200 – plan generated
400 – invalid payload (schema)
401/403 – missing/invalid bearer (only if ADMIN_TOKEN configured)
429 – rate limited
502 – upstream model error after retries

`POST /v1/chat`

Given a query about MatrixHub, returns an answer with citations if a local KB is configured at RAG_KB_PATH. Uses the same provider cascade.

`GET /v1/chat/stream` & `POST /v1/chat/stream`

Server-Sent Events (SSE) streaming of token deltas. Production-safe middleware ensures no body buffering and proper headers (Cache-Control: no-cache, X-Trace-Id, X-Process-Time-Ms, Server-Timing).

Safety & Reliability

PII redaction – tokens/emails removed from prompts as a pre-filter
Strict schema – JSON plan parsing with safe defaults; rejects unsafe shapes
Time-boxed – short timeouts and bounded retries to providers
Rate-limited – per-IP fixed window (configurable)
Structured logs – JSON logs with trace_id for correlation
SSE-safe middleware – never consumes streaming bodies; avoids Starlette “No response returned” pitfalls

RAG (Optional)

Embeddings: sentence-transformers/all-MiniLM-L6-v2 (GPU-aware)
Re-ranking: optional cross-encoder/ms-marco-MiniLM-L-2-v2 (GPU-aware)
KB: data/kb.jsonl (one JSON per line: { "text": "...", "source": "..." })
Tunable: rag.top_k, RAG_RERANK, RAG_KB_PATH

Deployments

Hugging Face Spaces (recommended for demo)

Push repo to a new Space (FastAPI).
Settings → Secrets:
- GROQ_API_KEY, GOOGLE_API_KEY, HF_TOKEN (as needed by cascade)
- ADMIN_TOKEN (optional; gates /v1/plan & /v1/chat*)
Choose hardware (CPU is fine; GPU improves RAG throughput and cross-encoder).
Space runs uvicorn and exposes all endpoints.

Containers / Cloud

Use a minimal Python base, install with pip install -r requirements.txt.
Expose port 7860 (configurable).
Set secrets via your orchestrator (Kubernetes Secrets, ECS, etc.).
Scale with multiple Uvicorn workers; put behind an HTTP proxy that supports streaming (e.g., nginx with proxy_buffering off for SSE).

Observability

Trace IDs (X-Trace-Id) attached per request and logged
Timing headers: X-Process-Time-Ms, Server-Timing
Provider selection logs (e.g., Provider 'groq' succeeded in 0.82s)
Metrics endpoints can be added behind an auth wall (Prometheus friendly)

Development Notes

Keep /v1/plan internal behind a network boundary or ADMIN_TOKEN.
Validate payloads rigorously (Pydantic) and write contract tests for the plan schema.
If you switch models, re-run golden tests to guard against plan drift.
Avoid logging sensitive data; logs are structured JSON only.

License

Apache-2.0

Tip: The cascade order is controlled by provider_order (groq,gemini,router). If Groq is rate-limited or missing, the service automatically falls back to Gemini, then Hugging Face Router (Zephyr → Mistral). Streaming works out of the box and is middleware-safe.