Spaces:

MCP-1st-Birthday
/

sdlc-agent

Runtime error

App Files Files Community

sdlc-agent / docs /api /RAG_API.md

Veeru-c

initial commit

06bd253 17 days ago

preview code

raw

history blame contribute delete

4.68 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

RAG API Documentation

Fast API endpoint for querying the product design RAG system with <3 second response times.

Quick Start

Deploy the API

# Deploy to Modal
modal deploy src/rag/rag_api.py

# Get the URL
modal app list

Use the API

from src.rag.api_client import RAGAPIClient

client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
result = client.query("What are the three product tiers?")
print(result['answer'])

API Endpoints

Health Check

GET /health

Response:

{
  "status": "healthy",
  "service": "rag-api"
}

Query

POST /query
Content-Type: application/json

{
  "question": "What are the three product tiers?",
  "top_k": 5,
  "max_tokens": 1024
}

Response:

{
  "answer": "The three product tiers are...",
  "retrieval_time": 0.45,
  "generation_time": 1.23,
  "total_time": 1.68,
  "sources": [
    {
      "content": "...",
      "metadata": {...}
    }
  ],
  "success": true
}

Performance Optimization

Target: <3 Second Responses

The API is optimized for fast responses:

Warm Containers: min_containers=1 keeps a container ready
Optimized LLM: Reduced max_tokens (1024 vs 1536)
Limited Context: Top 3 documents, 800 chars each
Prefix Caching: Enabled for faster generation
Concurrent Requests: Up to 10 concurrent requests

Response Time Breakdown

Retrieval: 0.3-0.8 seconds
Generation: 1.0-2.0 seconds
Total: 1.5-3.0 seconds (target: <3s)

Usage Examples

Python Client

from src.rag.api_client import RAGAPIClient

# Initialize
client = RAGAPIClient(base_url="https://your-api-url.modal.run")

# Health check
health = client.health_check()
print(health)

# Query
result = client.query("What are the premium ranges?")
print(result['answer'])

# Fast query (optimized for speed)
result = client.query_fast("What are the three tiers?")
print(result['answer'])

cURL

# Health check
curl https://your-api-url.modal.run/health

# Query
curl -X POST https://your-api-url.modal.run/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the three product tiers?",
    "top_k": 5,
    "max_tokens": 1024
  }'

JavaScript/TypeScript

const response = await fetch('https://your-api-url.modal.run/query', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    question: 'What are the three product tiers?',
    top_k: 5,
    max_tokens: 1024
  })
});

const data = await response.json();
console.log(data.answer);

Configuration

Environment Variables

MODAL_APP_NAME: App name (default: "insurance-rag-api")
MODAL_VOLUME_NAME: Volume name (default: "mcp-hack-ins-products")

API Parameters

question (required): The question to ask
top_k (optional, default: 5): Number of documents to retrieve
max_tokens (optional, default: 1024): Maximum response length

Performance Tips

Use Fast Query: For speed-critical applications, use query_fast() method
Reduce top_k: Lower top_k (e.g., 3) for faster retrieval
Reduce max_tokens: Lower max_tokens (e.g., 512) for faster generation
Cache Results: Cache common queries client-side
Batch Requests: If possible, batch multiple queries

Error Handling

result = client.query("your question")

if result.get("success"):
    print(result['answer'])
else:
    print(f"Error: {result.get('error', 'Unknown error')}")

Monitoring

Response Times

Monitor the total_time field in responses:

< 2s: Excellent
2-3s: Good (target)
3s: May need optimization

Health Monitoring

health = client.health_check()
if health.get("status") != "healthy":
    # Handle unhealthy state
    pass

Deployment

Modal Deployment

# Deploy
modal deploy src/rag/rag_api.py

# Get URL
modal app show insurance-rag-api

Local Testing

# Run locally (for development)
modal serve src/rag/rag_api.py

Rate Limiting

The API supports up to 10 concurrent requests. For higher throughput:

Deploy multiple instances
Use load balancer
Implement client-side rate limiting

Security

Add authentication if needed
Use HTTPS in production
Implement rate limiting
Validate input questions

Troubleshooting

Slow Responses (>3s)

Check if container is warm (min_containers=1)
Reduce max_tokens
Reduce top_k
Check network latency

Errors

Verify documents are indexed
Check Modal app status
Review error messages in response