Spaces:
Paused
W&B MCP Server - Architecture & Scalability Guide
Table of Contents
- Architecture Decision
- Stateless HTTP Design
- Performance & Scalability
- Load Test Results
- Deployment Recommendations
Architecture Decision
Decision: Pure Stateless HTTP Mode
The W&B MCP Server uses pure stateless HTTP mode (stateless_http=True).
This fundamental architecture decision enables:
- β Universal client compatibility (OpenAI, Cursor, LeChat, Claude)
- β Horizontal scaling capabilities
- β Simpler operations and maintenance
- β Cloud-native deployment patterns
Why Stateless?
The Model Context Protocol traditionally used stateful sessions, but this created issues:
| Client | Behavior | Problem with Stateful |
|---|---|---|
| OpenAI | Deletes session after listing tools, then reuses ID | Session not found errors |
| Cursor | Sends Bearer token with every request | Expects stateless behavior |
| Claude | Can work with either model | No issues |
The Solution
# Pure stateless operation - no session persistence
mcp = FastMCP("wandb-mcp-server", stateless_http=True)
With this approach:
- Session IDs are correlation IDs only - they match requests to responses
- No state persists between requests - each request is independent
- Authentication required per request - Bearer token must be included
- Any worker can handle any request - enables horizontal scaling
Stateless HTTP Design
Architecture Overview
βββββββββββββββββββββββββββββββββββββββ
β MCP Clients (OpenAI/Cursor/etc) β
β Bearer Token with Each Request β
βββββββββββββββ¬ββββββββββββββββββββββββ
β HTTPS
βββββββββββββββΌββββββββββββββββββββββββ
β Load Balancer (Optional) β
β Round-Robin Distribution β
ββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ
β β β
ββββΌββββ ββββΌββββ ββββΌββββ
β W1 β β W2 β β W3 β (Multiple Workers Possible)
β β β β β β
β ASGI β β ASGI β β ASGI β Uvicorn/Gunicorn
ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ
β β β
ββββΌβββββββββββΌβββββββββββΌβββββββββββββ
β FastAPI Application β
β ββββββββββββββββββββββββββββββ β
β β Stateless Auth Middleware β β
β β (Bearer Token Validation) β β
β ββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββ β
β β MCP Stateless Handler β β
β β (No Session Storage) β β
β ββββββββββββββββββββββββββββββ β
βββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββββββββββ
β W&B API Integration β
βββββββββββββββββββββββββββββββββββββββ
Request Flow
- Client sends request with Bearer token and session ID
- Middleware validates Bearer token
- MCP processes request (session ID used for correlation only)
- Response sent with matching session ID
- No state persisted - request complete
Key Implementation Details
async def thread_safe_auth_middleware(request: Request, call_next):
"""Stateless authentication middleware."""
# Session IDs are correlation IDs only
session_id = request.headers.get("Mcp-Session-Id")
if session_id:
logger.debug(f"Correlation ID: {session_id[:8]}...")
# Every request must have Bearer token
authorization = request.headers.get("Authorization", "")
if authorization.startswith("Bearer "):
api_key = authorization[7:].strip()
# Use API key for this request only
# No session storage or retrieval
Performance & Scalability
Single Worker Performance
Based on testing with stateless mode:
| Metric | Local Server | Remote (HF Spaces) |
|---|---|---|
| Max Concurrent | 1000 clients | 500+ clients |
| Throughput | ~50-60 req/s | ~35 req/s |
| Latency (p50) | <500ms | <2s |
| Memory Usage | 200-500MB | 300-600MB |
Horizontal Scaling Potential
With stateless mode, the server supports true horizontal scaling:
| Workers | Max Concurrent | Total Throughput | Notes |
|---|---|---|---|
| 1 | 1000 | ~50 req/s | Current deployment |
| 2 | 2000 | ~100 req/s | Linear scaling |
| 4 | 4000 | ~200 req/s | Near-linear |
| 8 | 8000 | ~400 req/s | Some overhead |
Key Advantage: No session affinity required - any worker can handle any request!
Load Test Results
Latest Test Results (2025-09-25)
Local Server (MacOS, Single Worker)
| Concurrent Clients | Success Rate | Throughput | Mean Response |
|---|---|---|---|
| 10 | 100% | 47 req/s | 89ms |
| 100 | 100% | 47 req/s | 1.2s |
| 500 | 100% | 56 req/s | 4.4s |
| 1000 | 100% | 48 req/s | 9.3s |
| 1500 | 80% | 51 req/s | 15.4s |
| 2000 | 70% | 53 req/s | 20.8s |
Breaking Point: ~1500 concurrent connections
Remote Server (mcp.withwandb.com)
| Concurrent Clients | Success Rate | Throughput | Mean Response |
|---|---|---|---|
| 10 | 100% | 10 req/s | 0.8s |
| 50 | 100% | 29 req/s | 1.2s |
| 100 | 100% | 33 req/s | 1.9s |
| 200 | 100% | 34 req/s | 3.3s |
| 500 | 100% | 35 req/s | 7.5s |
Key Finding: Remote server handles 500+ concurrent connections reliably!
Performance Sweet Spots
- Low Latency (<1s response): Use β€50 concurrent connections
- Balanced (good throughput & latency): Use 100-200 concurrent connections
- Maximum Throughput: Use 200-300 concurrent connections
- Maximum Capacity: Up to 500 concurrent (remote) or 1000 (local)
Deployment Recommendations
Current Deployment (HuggingFace Spaces)
Configuration:
- Single worker (can be increased)
- Stateless HTTP mode
- 2 vCPU, 16GB RAM
- Port 7860
Performance:
- 500+ concurrent connections
- ~35 req/s throughput
- 100% reliability up to 500 concurrent
Scaling Options
Option 1: Vertical Scaling
- Increase CPU/RAM on HuggingFace Spaces
- Can improve single-worker throughput
Option 2: Horizontal Scaling (Recommended)
# app.py - Enable multiple workers
uvicorn.run(app, host="0.0.0.0", port=PORT, workers=4)
Option 3: Multi-Region Deployment
- Deploy to multiple regions
- Use global load balancer
- Reduce latency for users worldwide
Production Checklist
β
Stateless mode enabled (stateless_http=True)
β
Bearer authentication on every request
β
Health check endpoint (/health)
β
Monitoring for response times and errors
β
Rate limiting (recommended: 100 req/s per client)
β
Connection limits (recommended: 500 concurrent)
Configuration Example
# Production configuration
mcp = FastMCP("wandb-mcp-server", stateless_http=True)
# Uvicorn with multiple workers (if needed)
if __name__ == "__main__":
uvicorn.run(
app,
host="0.0.0.0",
port=7860,
workers=1, # Increase for horizontal scaling
limit_concurrency=1000, # Connection limit
timeout_keep_alive=30, # Keepalive timeout
)
Security Considerations
- API Key Validation: Every request validates Bearer token
- No Session Storage: No risk of session hijacking
- Rate Limiting: Protect against abuse
- HTTPS Only: Always use TLS in production
- Token Rotation: Encourage regular API key rotation
Summary
The W&B MCP Server's stateless architecture provides:
- Universal Compatibility: Works with all MCP clients
- Excellent Performance: 500+ concurrent connections, ~35 req/s
- Horizontal Scalability: Add workers to increase capacity
- Simple Operations: No session management complexity
- Production Ready: Deployed and tested at scale
The stateless design is not a compromise - it's the optimal architecture for MCP servers in production environments.