SB-PoC / README.md
Chirapath's picture
First draft coding project
963ae98 verified
# Unified AI Services
A comprehensive AI platform that integrates Named Entity Recognition (NER), Optical Character Recognition (OCR), and Retrieval-Augmented Generation (RAG) services into a unified application.
## 🌟 Features
### Core Services
- **NER Service** (Port 8500): Advanced named entity recognition with relationship extraction
- **OCR Service** (Port 8400): Document processing with Azure Document Intelligence
- **RAG Service** (Port 8401): Vector search and document retrieval
- **Unified App** (Port 8000): Coordinated workflows and service management
### Key Capabilities
- βœ… Multi-language support (Thai + English)
- βœ… Complex relationship extraction
- βœ… Entity deduplication
- βœ… Graph database exports (Neo4j, GraphML, GEXF)
- βœ… Vector search with semantic similarity
- βœ… Document processing (PDF, images, text)
- βœ… Real-time service health monitoring
- βœ… Unified workflows combining all services
- βœ… Comprehensive API documentation
## πŸš€ Quick Start
### Prerequisites
- Python 3.8 or higher
- PostgreSQL with vector extension support
- Azure OpenAI account
- Azure Document Intelligence account
- DeepSeek API account (for advanced NER)
### Automated Setup
1. **Clone and navigate to the project directory**
```bash
cd unified-ai-services
```
2. **Run the automated setup**
```bash
python setup.py
```
This will:
- Check your Python environment
- Create necessary directories
- Help you configure .env file
- Install dependencies
- Validate configuration
- Create startup scripts
3. **Start the unified application**
```bash
python app.py
```
Or use the generated scripts:
- Windows: `start_services.bat`
- Unix/Linux/Mac: `./start_services.sh`
4. **Run comprehensive tests**
```bash
python test_unified.py
```
Or use the generated scripts:
- Windows: `run_tests.bat`
- Unix/Linux/Mac: `./run_tests.sh`
### Manual Setup
If you prefer manual setup:
1. **Install dependencies**
```bash
pip install -r requirements.txt
```
2. **Create .env file** (copy from .env.example)
```bash
cp .env.example .env
# Edit .env with your configuration
```
3. **Set up directories**
```bash
mkdir -p services exports logs temp tests data
```
4. **Place service files in the services directory**
```
services/
β”œβ”€β”€ ner_service.py
β”œβ”€β”€ ocr_service.py
└── rag_service.py
```
## πŸ“ Project Structure
```
unified-ai-services/
β”œβ”€β”€ app.py # Main unified application
β”œβ”€β”€ configs.py # Centralized configuration
β”œβ”€β”€ setup.py # Automated setup script
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ test_unified.py # Comprehensive test suite
β”œβ”€β”€ .env # Environment configuration
β”œβ”€β”€ services/ # Individual service files
β”‚ β”œβ”€β”€ ner_service.py # NER service implementation
β”‚ β”œβ”€β”€ ocr_service.py # OCR service implementation
β”‚ └── rag_service.py # RAG service implementation
β”œβ”€β”€ exports/ # Generated export files
β”œβ”€β”€ logs/ # Application logs
β”œβ”€β”€ temp/ # Temporary files
β”œβ”€β”€ tests/ # Additional test files
└── data/ # Data files
```
## βš™οΈ Configuration
### Environment Variables
The system uses a `.env` file for configuration. Key variables include:
#### Server Configuration
```bash
HOST=0.0.0.0
DEBUG=True
MAIN_PORT=8000
NER_PORT=8500
OCR_PORT=8400
RAG_PORT=8401
```
#### Database Configuration
```bash
POSTGRES_HOST=your-postgres-server.com
POSTGRES_PORT=5432
POSTGRES_USER=your-username
POSTGRES_PASSWORD=your-password
POSTGRES_DATABASE=postgres
```
#### Azure OpenAI Configuration
```bash
AZURE_OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
EMBEDDING_MODEL=text-embedding-3-large
```
#### DeepSeek Configuration
```bash
DEEPSEEK_ENDPOINT=https://your-deepseek-endpoint/
DEEPSEEK_API_KEY=your-deepseek-key
DEEPSEEK_MODEL=DeepSeek-R1-0528
```
#### Azure Document Intelligence Configuration
```bash
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-di-key
```
#### Azure Storage Configuration
```bash
AZURE_STORAGE_ACCOUNT_URL=https://yourstorage.blob.core.windows.net/
AZURE_BLOB_SAS_TOKEN=your-sas-token
BLOB_CONTAINER=historylog
```
## πŸ”§ API Documentation
Once running, access the interactive API documentation:
- **Unified API**: http://localhost:8000/docs
- **NER Service**: http://localhost:8500/docs
- **OCR Service**: http://localhost:8400/docs
- **RAG Service**: http://localhost:8401/docs
## 🎯 API Usage Examples
### 1. Unified Analysis (Text + RAG Indexing)
```python
import httpx
async def unified_analysis():
data = {
"text": "Your text content here...",
"extract_relationships": True,
"include_embeddings": False,
"generate_graph_files": True,
"export_formats": ["neo4j", "json"],
"enable_rag_indexing": True,
"rag_title": "My Document",
"rag_keywords": ["keyword1", "keyword2"]
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:8000/analyze/unified", json=data)
return response.json()
```
### 2. Combined Search with NER Analysis
```python
async def combined_search():
data = {
"query": "search query here",
"limit": 10,
"similarity_threshold": 0.2,
"include_ner_analysis": True
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:8000/search/combined", json=data)
return response.json()
```
### 3. File Upload Analysis
```python
async def analyze_file():
files = {"file": ("document.pdf", open("document.pdf", "rb"), "application/pdf")}
data = {
"extract_relationships": "true",
"generate_graph_files": "true",
"export_formats": "neo4j,json"
}
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:8000/ner/analyze/file", files=files, data=data)
return response.json()
```
## πŸ§ͺ Testing
### Comprehensive Test Suite
The project includes comprehensive tests covering:
- βœ… Service health checks
- βœ… Individual service functionality
- βœ… Unified workflow testing
- βœ… Service proxy functionality
- βœ… Error handling and resilience
- βœ… Performance testing
- βœ… File upload/download testing
Run tests with:
```bash
python test_unified.py
```
### Individual Service Tests
Test individual services:
```bash
# Test NER service
python test_ner.py
# Test RAG service
python test_rag.py
```
### Quick Health Check
```bash
curl http://localhost:8000/health
```
## πŸ” Monitoring and Health Checks
### Health Endpoints
- **Unified System**: `GET /health`
- **Individual Services**: `GET /ner/health`, `GET /ocr/health`, `GET /rag/health`
- **Detailed Status**: `GET /status`
- **Service Discovery**: `GET /services`
### Monitoring Features
- Real-time service health monitoring
- Response time tracking
- Service uptime monitoring
- Error rate tracking
- Resource usage monitoring
## πŸ“Š Service Architecture
```mermaid
graph TB
Client[Client Applications]
subgraph "Unified AI Services (Port 8000)"
UA[Unified App]
Proxy[Service Proxies]
Health[Health Monitor]
end
subgraph "Core Services"
NER[NER Service<br/>Port 8500]
OCR[OCR Service<br/>Port 8400]
RAG[RAG Service<br/>Port 8401]
end
subgraph "External Services"
Azure[Azure Services]
DeepSeek[DeepSeek API]
DB[(PostgreSQL)]
end
Client --> UA
UA --> Proxy
Proxy --> NER
Proxy --> OCR
Proxy --> RAG
NER --> Azure
NER --> DeepSeek
NER --> DB
OCR --> Azure
RAG --> Azure
RAG --> DB
RAG --> OCR
```
## πŸ› οΈ Development
### Adding New Features
1. **Service Modifications**: Update individual service files in `services/`
2. **Unified Workflows**: Modify `app.py` for new combined workflows
3. **Configuration**: Update `configs.py` for new settings
4. **Tests**: Add tests to `test_unified.py`
### Debugging
1. **Check Service Logs**: Services log to console
2. **Health Checks**: Use `/health` endpoints
3. **Configuration**: Run `python configs.py` to validate
4. **Database**: Check PostgreSQL connectivity
5. **Azure Services**: Verify API keys and endpoints
### Service Management
Start individual services for development:
```bash
# Start NER service only
cd services && python ner_service.py
# Start OCR service only
cd services && python ocr_service.py
# Start RAG service only
cd services && python rag_service.py
```
## 🚨 Troubleshooting
### Common Issues
#### 1. Services Won't Start
- Check port availability: `netstat -an | grep :8000`
- Verify Python dependencies: `pip list`
- Check .env configuration: `python configs.py`
#### 2. Database Connection Issues
- Verify PostgreSQL is running
- Check connection string in .env
- Test connectivity: `python -c "import asyncpg; asyncio.run(asyncpg.connect('your-connection-string'))"`
#### 3. Azure Service Issues
- Verify API keys and endpoints
- Check Azure service status
- Review rate limits and quotas
#### 4. Performance Issues
- Monitor resource usage: `top` or Task Manager
- Check database performance
- Review log files for errors
### Error Codes
- **500**: Internal service error
- **503**: Service unavailable
- **400**: Bad request (check input data)
- **422**: Validation error
- **404**: Endpoint not found
## πŸ“ˆ Performance Optimization
### Recommended Settings
#### Production Configuration
```bash
DEBUG=False
MAX_FILE_SIZE=50
REQUEST_TIMEOUT=300
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
```
#### Database Optimization
- Use connection pooling
- Configure appropriate indexes
- Monitor query performance
- Regular maintenance
#### Service Optimization
- Enable caching where appropriate
- Use async operations
- Optimize batch processing
- Monitor memory usage
## πŸ” Security Considerations
### API Security
- Implement authentication/authorization as needed
- Use HTTPS in production
- Validate all input data
- Rate limiting
### Data Security
- Secure database connections (SSL)
- Encrypt sensitive data
- Regular security updates
- Monitor access logs
### Azure Security
- Rotate API keys regularly
- Use managed identities where possible
- Monitor usage and costs
- Follow Azure security best practices
## πŸ“ License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request
## πŸ“ž Support
For support and questions:
1. Check this README for common issues
2. Review the test suite for usage examples
3. Check service logs for error details
4. Verify configuration with `python configs.py`
## 🎯 Roadmap
### Current Version (1.0.0)
- βœ… Unified service integration
- βœ… Comprehensive testing
- βœ… Multi-language support
- βœ… Graph database exports
### Future Enhancements
- πŸ”„ Advanced caching mechanisms
- πŸ”„ Enhanced monitoring and analytics
- πŸ”„ Additional export formats
- πŸ”„ Improved error recovery
- πŸ”„ Performance optimizations
- πŸ”„ Additional language support