SB-PoC / README.md
Chirapath's picture
First draft coding project
963ae98 verified

Unified AI Services

A comprehensive AI platform that integrates Named Entity Recognition (NER), Optical Character Recognition (OCR), and Retrieval-Augmented Generation (RAG) services into a unified application.

🌟 Features

Core Services

  • NER Service (Port 8500): Advanced named entity recognition with relationship extraction
  • OCR Service (Port 8400): Document processing with Azure Document Intelligence
  • RAG Service (Port 8401): Vector search and document retrieval
  • Unified App (Port 8000): Coordinated workflows and service management

Key Capabilities

  • βœ… Multi-language support (Thai + English)
  • βœ… Complex relationship extraction
  • βœ… Entity deduplication
  • βœ… Graph database exports (Neo4j, GraphML, GEXF)
  • βœ… Vector search with semantic similarity
  • βœ… Document processing (PDF, images, text)
  • βœ… Real-time service health monitoring
  • βœ… Unified workflows combining all services
  • βœ… Comprehensive API documentation

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • PostgreSQL with vector extension support
  • Azure OpenAI account
  • Azure Document Intelligence account
  • DeepSeek API account (for advanced NER)

Automated Setup

  1. Clone and navigate to the project directory

    cd unified-ai-services
    
  2. Run the automated setup

    python setup.py
    

    This will:

    • Check your Python environment
    • Create necessary directories
    • Help you configure .env file
    • Install dependencies
    • Validate configuration
    • Create startup scripts
  3. Start the unified application

    python app.py
    

    Or use the generated scripts:

    • Windows: start_services.bat
    • Unix/Linux/Mac: ./start_services.sh
  4. Run comprehensive tests

    python test_unified.py
    

    Or use the generated scripts:

    • Windows: run_tests.bat
    • Unix/Linux/Mac: ./run_tests.sh

Manual Setup

If you prefer manual setup:

  1. Install dependencies

    pip install -r requirements.txt
    
  2. Create .env file (copy from .env.example)

    cp .env.example .env
    # Edit .env with your configuration
    
  3. Set up directories

    mkdir -p services exports logs temp tests data
    
  4. Place service files in the services directory

    services/
    β”œβ”€β”€ ner_service.py
    β”œβ”€β”€ ocr_service.py
    └── rag_service.py
    

πŸ“ Project Structure

unified-ai-services/
β”œβ”€β”€ app.py                    # Main unified application
β”œβ”€β”€ configs.py               # Centralized configuration
β”œβ”€β”€ setup.py                 # Automated setup script
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ test_unified.py          # Comprehensive test suite
β”œβ”€β”€ .env                     # Environment configuration
β”œβ”€β”€ services/                # Individual service files
β”‚   β”œβ”€β”€ ner_service.py      # NER service implementation
β”‚   β”œβ”€β”€ ocr_service.py      # OCR service implementation
β”‚   └── rag_service.py      # RAG service implementation
β”œβ”€β”€ exports/                 # Generated export files
β”œβ”€β”€ logs/                    # Application logs
β”œβ”€β”€ temp/                    # Temporary files
β”œβ”€β”€ tests/                   # Additional test files
└── data/                    # Data files

βš™οΈ Configuration

Environment Variables

The system uses a .env file for configuration. Key variables include:

Server Configuration

HOST=0.0.0.0
DEBUG=True
MAIN_PORT=8000
NER_PORT=8500
OCR_PORT=8400
RAG_PORT=8401

Database Configuration

POSTGRES_HOST=your-postgres-server.com
POSTGRES_PORT=5432
POSTGRES_USER=your-username
POSTGRES_PASSWORD=your-password
POSTGRES_DATABASE=postgres

Azure OpenAI Configuration

AZURE_OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
EMBEDDING_MODEL=text-embedding-3-large

DeepSeek Configuration

DEEPSEEK_ENDPOINT=https://your-deepseek-endpoint/
DEEPSEEK_API_KEY=your-deepseek-key
DEEPSEEK_MODEL=DeepSeek-R1-0528

Azure Document Intelligence Configuration

AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-di-key

Azure Storage Configuration

AZURE_STORAGE_ACCOUNT_URL=https://yourstorage.blob.core.windows.net/
AZURE_BLOB_SAS_TOKEN=your-sas-token
BLOB_CONTAINER=historylog

πŸ”§ API Documentation

Once running, access the interactive API documentation:

🎯 API Usage Examples

1. Unified Analysis (Text + RAG Indexing)

import httpx

async def unified_analysis():
    data = {
        "text": "Your text content here...",
        "extract_relationships": True,
        "include_embeddings": False,
        "generate_graph_files": True,
        "export_formats": ["neo4j", "json"],
        "enable_rag_indexing": True,
        "rag_title": "My Document",
        "rag_keywords": ["keyword1", "keyword2"]
    }
    
    async with httpx.AsyncClient() as client:
        response = await client.post("http://localhost:8000/analyze/unified", json=data)
        return response.json()

2. Combined Search with NER Analysis

async def combined_search():
    data = {
        "query": "search query here",
        "limit": 10,
        "similarity_threshold": 0.2,
        "include_ner_analysis": True
    }
    
    async with httpx.AsyncClient() as client:
        response = await client.post("http://localhost:8000/search/combined", json=data)
        return response.json()

3. File Upload Analysis

async def analyze_file():
    files = {"file": ("document.pdf", open("document.pdf", "rb"), "application/pdf")}
    data = {
        "extract_relationships": "true",
        "generate_graph_files": "true",
        "export_formats": "neo4j,json"
    }
    
    async with httpx.AsyncClient() as client:
        response = await client.post("http://localhost:8000/ner/analyze/file", files=files, data=data)
        return response.json()

πŸ§ͺ Testing

Comprehensive Test Suite

The project includes comprehensive tests covering:

  • βœ… Service health checks
  • βœ… Individual service functionality
  • βœ… Unified workflow testing
  • βœ… Service proxy functionality
  • βœ… Error handling and resilience
  • βœ… Performance testing
  • βœ… File upload/download testing

Run tests with:

python test_unified.py

Individual Service Tests

Test individual services:

# Test NER service
python test_ner.py

# Test RAG service  
python test_rag.py

Quick Health Check

curl http://localhost:8000/health

πŸ” Monitoring and Health Checks

Health Endpoints

  • Unified System: GET /health
  • Individual Services: GET /ner/health, GET /ocr/health, GET /rag/health
  • Detailed Status: GET /status
  • Service Discovery: GET /services

Monitoring Features

  • Real-time service health monitoring
  • Response time tracking
  • Service uptime monitoring
  • Error rate tracking
  • Resource usage monitoring

πŸ“Š Service Architecture

graph TB
    Client[Client Applications]
    
    subgraph "Unified AI Services (Port 8000)"
        UA[Unified App]
        Proxy[Service Proxies]
        Health[Health Monitor]
    end
    
    subgraph "Core Services"
        NER[NER Service<br/>Port 8500]
        OCR[OCR Service<br/>Port 8400]
        RAG[RAG Service<br/>Port 8401]
    end
    
    subgraph "External Services"
        Azure[Azure Services]
        DeepSeek[DeepSeek API]
        DB[(PostgreSQL)]
    end
    
    Client --> UA
    UA --> Proxy
    Proxy --> NER
    Proxy --> OCR
    Proxy --> RAG
    
    NER --> Azure
    NER --> DeepSeek
    NER --> DB
    
    OCR --> Azure
    
    RAG --> Azure
    RAG --> DB
    RAG --> OCR

πŸ› οΈ Development

Adding New Features

  1. Service Modifications: Update individual service files in services/
  2. Unified Workflows: Modify app.py for new combined workflows
  3. Configuration: Update configs.py for new settings
  4. Tests: Add tests to test_unified.py

Debugging

  1. Check Service Logs: Services log to console
  2. Health Checks: Use /health endpoints
  3. Configuration: Run python configs.py to validate
  4. Database: Check PostgreSQL connectivity
  5. Azure Services: Verify API keys and endpoints

Service Management

Start individual services for development:

# Start NER service only
cd services && python ner_service.py

# Start OCR service only  
cd services && python ocr_service.py

# Start RAG service only
cd services && python rag_service.py

🚨 Troubleshooting

Common Issues

1. Services Won't Start

  • Check port availability: netstat -an | grep :8000
  • Verify Python dependencies: pip list
  • Check .env configuration: python configs.py

2. Database Connection Issues

  • Verify PostgreSQL is running
  • Check connection string in .env
  • Test connectivity: python -c "import asyncpg; asyncio.run(asyncpg.connect('your-connection-string'))"

3. Azure Service Issues

  • Verify API keys and endpoints
  • Check Azure service status
  • Review rate limits and quotas

4. Performance Issues

  • Monitor resource usage: top or Task Manager
  • Check database performance
  • Review log files for errors

Error Codes

  • 500: Internal service error
  • 503: Service unavailable
  • 400: Bad request (check input data)
  • 422: Validation error
  • 404: Endpoint not found

πŸ“ˆ Performance Optimization

Recommended Settings

Production Configuration

DEBUG=False
MAX_FILE_SIZE=50
REQUEST_TIMEOUT=300
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Database Optimization

  • Use connection pooling
  • Configure appropriate indexes
  • Monitor query performance
  • Regular maintenance

Service Optimization

  • Enable caching where appropriate
  • Use async operations
  • Optimize batch processing
  • Monitor memory usage

πŸ” Security Considerations

API Security

  • Implement authentication/authorization as needed
  • Use HTTPS in production
  • Validate all input data
  • Rate limiting

Data Security

  • Secure database connections (SSL)
  • Encrypt sensitive data
  • Regular security updates
  • Monitor access logs

Azure Security

  • Rotate API keys regularly
  • Use managed identities where possible
  • Monitor usage and costs
  • Follow Azure security best practices

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

πŸ“ž Support

For support and questions:

  1. Check this README for common issues
  2. Review the test suite for usage examples
  3. Check service logs for error details
  4. Verify configuration with python configs.py

🎯 Roadmap

Current Version (1.0.0)

  • βœ… Unified service integration
  • βœ… Comprehensive testing
  • βœ… Multi-language support
  • βœ… Graph database exports

Future Enhancements

  • πŸ”„ Advanced caching mechanisms
  • πŸ”„ Enhanced monitoring and analytics
  • πŸ”„ Additional export formats
  • πŸ”„ Improved error recovery
  • πŸ”„ Performance optimizations
  • πŸ”„ Additional language support