Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
6.1.0
RAG API Documentation
Fast API endpoint for querying the product design RAG system with <3 second response times.
Quick Start
Deploy the API
# Deploy to Modal
modal deploy src/rag/rag_api.py
# Get the URL
modal app list
Use the API
from src.rag.api_client import RAGAPIClient
client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
result = client.query("What are the three product tiers?")
print(result['answer'])
API Endpoints
Health Check
GET /health
Response:
{
"status": "healthy",
"service": "rag-api"
}
Query
POST /query
Content-Type: application/json
{
"question": "What are the three product tiers?",
"top_k": 5,
"max_tokens": 1024
}
Response:
{
"answer": "The three product tiers are...",
"retrieval_time": 0.45,
"generation_time": 1.23,
"total_time": 1.68,
"sources": [
{
"content": "...",
"metadata": {...}
}
],
"success": true
}
Performance Optimization
Target: <3 Second Responses
The API is optimized for fast responses:
- Warm Containers:
min_containers=1keeps a container ready - Optimized LLM: Reduced max_tokens (1024 vs 1536)
- Limited Context: Top 3 documents, 800 chars each
- Prefix Caching: Enabled for faster generation
- Concurrent Requests: Up to 10 concurrent requests
Response Time Breakdown
- Retrieval: 0.3-0.8 seconds
- Generation: 1.0-2.0 seconds
- Total: 1.5-3.0 seconds (target: <3s)
Usage Examples
Python Client
from src.rag.api_client import RAGAPIClient
# Initialize
client = RAGAPIClient(base_url="https://your-api-url.modal.run")
# Health check
health = client.health_check()
print(health)
# Query
result = client.query("What are the premium ranges?")
print(result['answer'])
# Fast query (optimized for speed)
result = client.query_fast("What are the three tiers?")
print(result['answer'])
cURL
# Health check
curl https://your-api-url.modal.run/health
# Query
curl -X POST https://your-api-url.modal.run/query \
-H "Content-Type: application/json" \
-d '{
"question": "What are the three product tiers?",
"top_k": 5,
"max_tokens": 1024
}'
JavaScript/TypeScript
const response = await fetch('https://your-api-url.modal.run/query', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
question: 'What are the three product tiers?',
top_k: 5,
max_tokens: 1024
})
});
const data = await response.json();
console.log(data.answer);
Configuration
Environment Variables
MODAL_APP_NAME: App name (default: "insurance-rag-api")MODAL_VOLUME_NAME: Volume name (default: "mcp-hack-ins-products")
API Parameters
question(required): The question to asktop_k(optional, default: 5): Number of documents to retrievemax_tokens(optional, default: 1024): Maximum response length
Performance Tips
- Use Fast Query: For speed-critical applications, use
query_fast()method - Reduce top_k: Lower
top_k(e.g., 3) for faster retrieval - Reduce max_tokens: Lower
max_tokens(e.g., 512) for faster generation - Cache Results: Cache common queries client-side
- Batch Requests: If possible, batch multiple queries
Error Handling
result = client.query("your question")
if result.get("success"):
print(result['answer'])
else:
print(f"Error: {result.get('error', 'Unknown error')}")
Monitoring
Response Times
Monitor the total_time field in responses:
- < 2s: Excellent
- 2-3s: Good (target)
3s: May need optimization
Health Monitoring
health = client.health_check()
if health.get("status") != "healthy":
# Handle unhealthy state
pass
Deployment
Modal Deployment
# Deploy
modal deploy src/rag/rag_api.py
# Get URL
modal app show insurance-rag-api
Local Testing
# Run locally (for development)
modal serve src/rag/rag_api.py
Rate Limiting
The API supports up to 10 concurrent requests. For higher throughput:
- Deploy multiple instances
- Use load balancer
- Implement client-side rate limiting
Security
- Add authentication if needed
- Use HTTPS in production
- Implement rate limiting
- Validate input questions
Troubleshooting
Slow Responses (>3s)
- Check if container is warm (
min_containers=1) - Reduce
max_tokens - Reduce
top_k - Check network latency
Errors
- Verify documents are indexed
- Check Modal app status
- Review error messages in response