sdlc-agent / docs /api /RAG_API.md
Veeru-c's picture
initial commit
06bd253

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

RAG API Documentation

Fast API endpoint for querying the product design RAG system with <3 second response times.

Quick Start

Deploy the API

# Deploy to Modal
modal deploy src/rag/rag_api.py

# Get the URL
modal app list

Use the API

from src.rag.api_client import RAGAPIClient

client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
result = client.query("What are the three product tiers?")
print(result['answer'])

API Endpoints

Health Check

GET /health

Response:

{
  "status": "healthy",
  "service": "rag-api"
}

Query

POST /query
Content-Type: application/json

{
  "question": "What are the three product tiers?",
  "top_k": 5,
  "max_tokens": 1024
}

Response:

{
  "answer": "The three product tiers are...",
  "retrieval_time": 0.45,
  "generation_time": 1.23,
  "total_time": 1.68,
  "sources": [
    {
      "content": "...",
      "metadata": {...}
    }
  ],
  "success": true
}

Performance Optimization

Target: <3 Second Responses

The API is optimized for fast responses:

  1. Warm Containers: min_containers=1 keeps a container ready
  2. Optimized LLM: Reduced max_tokens (1024 vs 1536)
  3. Limited Context: Top 3 documents, 800 chars each
  4. Prefix Caching: Enabled for faster generation
  5. Concurrent Requests: Up to 10 concurrent requests

Response Time Breakdown

  • Retrieval: 0.3-0.8 seconds
  • Generation: 1.0-2.0 seconds
  • Total: 1.5-3.0 seconds (target: <3s)

Usage Examples

Python Client

from src.rag.api_client import RAGAPIClient

# Initialize
client = RAGAPIClient(base_url="https://your-api-url.modal.run")

# Health check
health = client.health_check()
print(health)

# Query
result = client.query("What are the premium ranges?")
print(result['answer'])

# Fast query (optimized for speed)
result = client.query_fast("What are the three tiers?")
print(result['answer'])

cURL

# Health check
curl https://your-api-url.modal.run/health

# Query
curl -X POST https://your-api-url.modal.run/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the three product tiers?",
    "top_k": 5,
    "max_tokens": 1024
  }'

JavaScript/TypeScript

const response = await fetch('https://your-api-url.modal.run/query', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    question: 'What are the three product tiers?',
    top_k: 5,
    max_tokens: 1024
  })
});

const data = await response.json();
console.log(data.answer);

Configuration

Environment Variables

  • MODAL_APP_NAME: App name (default: "insurance-rag-api")
  • MODAL_VOLUME_NAME: Volume name (default: "mcp-hack-ins-products")

API Parameters

  • question (required): The question to ask
  • top_k (optional, default: 5): Number of documents to retrieve
  • max_tokens (optional, default: 1024): Maximum response length

Performance Tips

  1. Use Fast Query: For speed-critical applications, use query_fast() method
  2. Reduce top_k: Lower top_k (e.g., 3) for faster retrieval
  3. Reduce max_tokens: Lower max_tokens (e.g., 512) for faster generation
  4. Cache Results: Cache common queries client-side
  5. Batch Requests: If possible, batch multiple queries

Error Handling

result = client.query("your question")

if result.get("success"):
    print(result['answer'])
else:
    print(f"Error: {result.get('error', 'Unknown error')}")

Monitoring

Response Times

Monitor the total_time field in responses:

  • < 2s: Excellent
  • 2-3s: Good (target)
  • 3s: May need optimization

Health Monitoring

health = client.health_check()
if health.get("status") != "healthy":
    # Handle unhealthy state
    pass

Deployment

Modal Deployment

# Deploy
modal deploy src/rag/rag_api.py

# Get URL
modal app show insurance-rag-api

Local Testing

# Run locally (for development)
modal serve src/rag/rag_api.py

Rate Limiting

The API supports up to 10 concurrent requests. For higher throughput:

  • Deploy multiple instances
  • Use load balancer
  • Implement client-side rate limiting

Security

  • Add authentication if needed
  • Use HTTPS in production
  • Implement rate limiting
  • Validate input questions

Troubleshooting

Slow Responses (>3s)

  • Check if container is warm (min_containers=1)
  • Reduce max_tokens
  • Reduce top_k
  • Check network latency

Errors

  • Verify documents are indexed
  • Check Modal app status
  • Review error messages in response