mcp-server / ARCHITECTURE.md
NiWaRe's picture
refactor for stateless: turn stateless on for FastMCP to work with OpenAI client etc
40e1a91

W&B MCP Server - Architecture & Scalability Guide

Table of Contents

  1. Architecture Decision
  2. Stateless HTTP Design
  3. Performance & Scalability
  4. Load Test Results
  5. Deployment Recommendations

Architecture Decision

Decision: Pure Stateless HTTP Mode

The W&B MCP Server uses pure stateless HTTP mode (stateless_http=True).

This fundamental architecture decision enables:

  • βœ… Universal client compatibility (OpenAI, Cursor, LeChat, Claude)
  • βœ… Horizontal scaling capabilities
  • βœ… Simpler operations and maintenance
  • βœ… Cloud-native deployment patterns

Why Stateless?

The Model Context Protocol traditionally used stateful sessions, but this created issues:

Client Behavior Problem with Stateful
OpenAI Deletes session after listing tools, then reuses ID Session not found errors
Cursor Sends Bearer token with every request Expects stateless behavior
Claude Can work with either model No issues

The Solution

# Pure stateless operation - no session persistence
mcp = FastMCP("wandb-mcp-server", stateless_http=True)

With this approach:

  • Session IDs are correlation IDs only - they match requests to responses
  • No state persists between requests - each request is independent
  • Authentication required per request - Bearer token must be included
  • Any worker can handle any request - enables horizontal scaling

Stateless HTTP Design

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    MCP Clients (OpenAI/Cursor/etc)  β”‚
β”‚     Bearer Token with Each Request   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚ HTTPS
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Load Balancer (Optional)     β”‚
β”‚      Round-Robin Distribution        β”‚
β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚          β”‚          β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”
β”‚ W1   β”‚  β”‚ W2   β”‚  β”‚ W3   β”‚  (Multiple Workers Possible)
β”‚      β”‚  β”‚      β”‚  β”‚      β”‚
β”‚ ASGI β”‚  β”‚ ASGI β”‚  β”‚ ASGI β”‚  Uvicorn/Gunicorn
β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜
   β”‚          β”‚          β”‚
β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         FastAPI Application         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚  Stateless Auth Middleware  β”‚     β”‚
β”‚  β”‚  (Bearer Token Validation)  β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚    MCP Stateless Handler    β”‚     β”‚
β”‚  β”‚  (No Session Storage)       β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         W&B API Integration         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Request Flow

  1. Client sends request with Bearer token and session ID
  2. Middleware validates Bearer token
  3. MCP processes request (session ID used for correlation only)
  4. Response sent with matching session ID
  5. No state persisted - request complete

Key Implementation Details

async def thread_safe_auth_middleware(request: Request, call_next):
    """Stateless authentication middleware."""
    
    # Session IDs are correlation IDs only
    session_id = request.headers.get("Mcp-Session-Id")
    if session_id:
        logger.debug(f"Correlation ID: {session_id[:8]}...")
    
    # Every request must have Bearer token
    authorization = request.headers.get("Authorization", "")
    if authorization.startswith("Bearer "):
        api_key = authorization[7:].strip()
        # Use API key for this request only
        # No session storage or retrieval

Performance & Scalability

Single Worker Performance

Based on testing with stateless mode:

Metric Local Server Remote (HF Spaces)
Max Concurrent 1000 clients 500+ clients
Throughput ~50-60 req/s ~35 req/s
Latency (p50) <500ms <2s
Memory Usage 200-500MB 300-600MB

Horizontal Scaling Potential

With stateless mode, the server supports true horizontal scaling:

Workers Max Concurrent Total Throughput Notes
1 1000 ~50 req/s Current deployment
2 2000 ~100 req/s Linear scaling
4 4000 ~200 req/s Near-linear
8 8000 ~400 req/s Some overhead

Key Advantage: No session affinity required - any worker can handle any request!


Load Test Results

Latest Test Results (2025-09-25)

Local Server (MacOS, Single Worker)

Concurrent Clients Success Rate Throughput Mean Response
10 100% 47 req/s 89ms
100 100% 47 req/s 1.2s
500 100% 56 req/s 4.4s
1000 100% 48 req/s 9.3s
1500 80% 51 req/s 15.4s
2000 70% 53 req/s 20.8s

Breaking Point: ~1500 concurrent connections

Remote Server (mcp.withwandb.com)

Concurrent Clients Success Rate Throughput Mean Response
10 100% 10 req/s 0.8s
50 100% 29 req/s 1.2s
100 100% 33 req/s 1.9s
200 100% 34 req/s 3.3s
500 100% 35 req/s 7.5s

Key Finding: Remote server handles 500+ concurrent connections reliably!

Performance Sweet Spots

  1. Low Latency (<1s response): Use ≀50 concurrent connections
  2. Balanced (good throughput & latency): Use 100-200 concurrent connections
  3. Maximum Throughput: Use 200-300 concurrent connections
  4. Maximum Capacity: Up to 500 concurrent (remote) or 1000 (local)

Deployment Recommendations

Current Deployment (HuggingFace Spaces)

Configuration:
  - Single worker (can be increased)
  - Stateless HTTP mode
  - 2 vCPU, 16GB RAM
  - Port 7860

Performance:
  - 500+ concurrent connections
  - ~35 req/s throughput
  - 100% reliability up to 500 concurrent

Scaling Options

Option 1: Vertical Scaling

  • Increase CPU/RAM on HuggingFace Spaces
  • Can improve single-worker throughput

Option 2: Horizontal Scaling (Recommended)

# app.py - Enable multiple workers
uvicorn.run(app, host="0.0.0.0", port=PORT, workers=4)

Option 3: Multi-Region Deployment

  • Deploy to multiple regions
  • Use global load balancer
  • Reduce latency for users worldwide

Production Checklist

βœ… Stateless mode enabled (stateless_http=True)
βœ… Bearer authentication on every request
βœ… Health check endpoint (/health)
βœ… Monitoring for response times and errors
βœ… Rate limiting (recommended: 100 req/s per client)
βœ… Connection limits (recommended: 500 concurrent)

Configuration Example

# Production configuration
mcp = FastMCP("wandb-mcp-server", stateless_http=True)

# Uvicorn with multiple workers (if needed)
if __name__ == "__main__":
    uvicorn.run(
        app,
        host="0.0.0.0",
        port=7860,
        workers=1,  # Increase for horizontal scaling
        limit_concurrency=1000,  # Connection limit
        timeout_keep_alive=30,  # Keepalive timeout
    )

Security Considerations

  1. API Key Validation: Every request validates Bearer token
  2. No Session Storage: No risk of session hijacking
  3. Rate Limiting: Protect against abuse
  4. HTTPS Only: Always use TLS in production
  5. Token Rotation: Encourage regular API key rotation

Summary

The W&B MCP Server's stateless architecture provides:

  • Universal Compatibility: Works with all MCP clients
  • Excellent Performance: 500+ concurrent connections, ~35 req/s
  • Horizontal Scalability: Add workers to increase capacity
  • Simple Operations: No session management complexity
  • Production Ready: Deployed and tested at scale

The stateless design is not a compromise - it's the optimal architecture for MCP servers in production environments.