Spaces:

vimalk78
/

abc123

Sleeping

File size: 16,179 Bytes

# Python Backend with Thematic AI Word Generation

This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings.

## 🚀 Features

- **Thematic Word Generation**: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary
- **319K+ Word Database**: Comprehensive vocabulary from WordFreq with frequency data
- **10-Tier Difficulty System**: Smart word selection based on frequency tiers
- **Environment Variable Configuration**: Flexible cache and model configuration
- **FastAPI**: Modern, fast Python web framework
- **Same API**: Compatible with existing React frontend

## 🔄 Differences from JavaScript Backend

| Feature | JavaScript Backend | Python Backend |
|---------|-------------------|----------------|
| **Word Generation** | Static word lists | Thematic AI word generation from 319K vocabulary |
| **Vocabulary Size** | ~100 words per topic | Filtered from 319K WordFreq database |
| **AI Approach** | Basic filtering | Semantic similarity with frequency tiers |
| **Performance** | Fast but limited | Slower startup, richer word selection |
| **Dependencies** | Node.js + static files | Python + ML libraries |

## 🛠️ Setup & Installation

### Prerequisites
- Python 3.11+ (3.11 recommended for Docker compatibility)
- pip (Python package manager)

### Basic Setup (Core Functionality)
```bash
# Clone and navigate to backend directory
cd crossword-app/backend-py

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install core dependencies
pip install -r requirements.txt

# Start the server
python app.py
```

### Full Development Setup (with AI features)
```bash
# Install development dependencies including AI/ML libraries
pip install -r requirements-dev.txt

# This includes:
# - All core dependencies
# - AI/ML libraries (torch, sentence-transformers, etc.)
# - Development tools (pytest, coverage, etc.)
```

### Requirements Files
- **`requirements.txt`**: Core dependencies for basic functionality
- **`requirements-dev.txt`**: Full development environment with AI features

> **Note**: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use `requirements.txt` only.

> **Python Version**: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility.

## 📁 Structure

```
backend-py/
├── app.py                          # FastAPI application entry point
├── requirements.txt                # Core Python dependencies
├── requirements-dev.txt            # Full development dependencies
├── src/
│   ├── services/
│   │   ├── thematic_word_service.py    # Thematic AI word generation
│   │   ├── crossword_generator.py      # Puzzle generation logic
│   │   └── crossword_generator_wrapper.py  # Service wrapper
│   └── routes/
│       └── api.py                      # API endpoints (matches JS backend)
├── test-unit/                      # Unit tests (pytest framework) - 5 files
│   ├── test_crossword_generator.py
│   ├── test_api_routes.py
│   └── test_vector_search.py
├── test-integration/               # Integration tests (standalone scripts) - 16 files
│   ├── test_simple_generation.py
│   ├── test_boundary_fix.py
│   └── test_local.py               # (+ 13 more test files)
├── data/ -> ../backend/data/       # Symlink to shared word data
└── public/                         # Frontend static files (copied during build)
```

## 🛠 Dependencies

### Core ML Stack
- `sentence-transformers`: Local model loading and embeddings
- `wordfreq`: 319K word vocabulary with frequency data
- `torch`: PyTorch for model inference
- `scikit-learn`: Cosine similarity and clustering
- `numpy`: Vector operations

### Web Framework
- `fastapi`: Modern Python web framework
- `uvicorn`: ASGI server
- `pydantic`: Data validation

### Testing
- `pytest`: Testing framework
- `pytest-asyncio`: Async test support

## 🧪 Testing

### 📁 Test Organization (Reorganized for Clarity)

**We've reorganized the test structure for better developer experience:**

| Test Type | Location | Purpose | Framework | Count |
|-----------|----------|---------|-----------|-------|
| **Unit Tests** | `test-unit/` | Test individual components in isolation | pytest | 5 files |
| **Integration Tests** | `test-integration/` | Test complete workflows end-to-end | Standalone scripts | 16 files |

**Benefits of this structure:**
- ✅ **Clear separation** between unit and integration testing
- ✅ **Intuitive naming** - developers immediately understand test types
- ✅ **Better tooling** - can run different test types independently
- ✅ **Easier maintenance** - organized by testing strategy

> **Note**: Previously tests were mixed in `tests/` folder and root-level `test_*.py` files. The new structure provides much better organization.

### Unit Tests Details (`test-unit/`)

**What they test:** Individual components with mocking and isolation
- `test_crossword_generator.py` - Core crossword generation logic
- `test_api_routes.py` - FastAPI endpoint handlers  
- `test_crossword_generator_wrapper.py` - Service wrapper layer
- `test_index_bug_fix.py` - Specific bug fix validations
- `test_vector_search.py` - AI vector search functionality (requires torch)

### Run Unit Tests (Formal Test Suite)
```bash
# Run all unit tests
python run_tests.py

# Run specific test modules  
python run_tests.py crossword_generator
pytest test-unit/test_crossword_generator.py -v

# Run core tests (excluding AI dependencies)
pytest test-unit/ -v --ignore=test-unit/test_vector_search.py

# Run individual unit test classes
pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v
```

### Integration Tests Details (`test-integration/`)

**What they test:** Complete workflows without mocking - real functionality
- `test_simple_generation.py` - End-to-end crossword generation
- `test_boundary_fix.py` - Word boundary validation (our major fix!)
- `test_local.py` - Local environment and dependencies
- `test_word_boundaries.py` - Comprehensive boundary testing
- `test_bounds_comprehensive.py` - Advanced bounds checking
- `test_final_validation.py` - API integration testing
- And 10 more specialized feature tests...

### Run Integration Tests (End-to-End Scripts)
```bash
# Test core functionality
python test-integration/test_simple_generation.py
python test-integration/test_boundary_fix.py
python test-integration/test_local.py

# Test specific features
python test-integration/test_word_boundaries.py
python test-integration/test_bounds_comprehensive.py

# Test API integration
python test-integration/test_final_validation.py
```

### Test Coverage
```bash
# Run core tests with coverage (requires requirements-dev.txt)
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term

# Full coverage report (may fail without AI dependencies)
pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py
```

### Test Status
- ✅ **Core crossword generation**: 15/19 unit tests passing
- ✅ **Boundary validation**: All integration tests passing
- ⚠️ **AI/Vector search**: Requires torch dependencies
- ⚠️ **Some async mocking**: Minor test infrastructure issues

### 🔄 Migration Guide (For Existing Developers)

**If you had previous commands, update them:**

| Old Command | New Command |
|-------------|-------------|
| `pytest tests/` | `pytest test-unit/` |
| `python test_simple_generation.py` | `python test-integration/test_simple_generation.py` |
| `pytest tests/ --cov=src` | `pytest test-unit/ --cov=src` |

**All functionality is preserved** - just organized better!

## 🔧 Configuration

### Environment Variables

The backend supports flexible configuration via environment variables:

```bash
# Cache Configuration
CACHE_DIR=/app/cache                        # Cache directory for all service files
THEMATIC_VOCAB_SIZE_LIMIT=50000            # Maximum vocabulary size (default: 100000)
THEMATIC_MODEL_NAME=all-mpnet-base-v2      # Sentence transformer model

# Core Application Settings  
PORT=7860                                  # Server port
NODE_ENV=production                        # Environment mode

# Optional
LOG_LEVEL=INFO                            # Logging level
```

### Cache Structure

The service creates the following cache files:

```
{CACHE_DIR}/
├── vocabulary_{size}.pkl              # Processed vocabulary words
├── frequencies_{size}.pkl             # Word frequency data
├── embeddings_{model}_{size}.npy      # Word embeddings
└── sentence-transformers/             # Hugging Face model cache
```

## 🎯 Thematic Word Generation Process

1. **Initialization**:
   - Load WordFreq vocabulary database (319K words)
   - Filter words for crossword suitability (length, content)  
   - Load sentence-transformers model locally
   - Pre-compute embeddings for filtered vocabulary
   - Create 10-tier frequency classification system

2. **Word Generation**:
   - Get topic embedding: `"Animals" → [768-dim vector]`
   - Compute cosine similarity with all vocabulary embeddings
   - Filter by similarity threshold and difficulty tier
   - Filter by crossword-specific criteria (length, etc.)
   - Return top matches with generated clues

3. **Multi-Theme Support**:
   - Detect multiple themes using clustering
   - Generate words that relate to combined themes
   - Balance word selection across different topics

## 🧪 Testing

```bash
# Local testing (without full vector search)
cd backend-py
python test_local.py

# Start development server
python app.py
```

## 🐳 Container Deployment

### Docker Run with Cache Configuration

```bash
# Basic deployment
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=50000 \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app

# With all configuration options
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=25000 \
           -e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \
           -e NODE_ENV=production \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app
```

### Docker Compose

```yaml
version: '3.8'
services:
  crossword-backend:
    image: your-crossword-app
    environment:
      - CACHE_DIR=/app/cache
      - THEMATIC_VOCAB_SIZE_LIMIT=50000
      - THEMATIC_MODEL_NAME=all-mpnet-base-v2
      - NODE_ENV=production
    volumes:
      - ./cache:/app/cache
    ports:
      - "7860:7860"
    restart: unless-stopped
```

### Pre-built Cache Strategy (Recommended)

For production deployments, pre-build the cache to avoid long startup times:

```bash
# 1. Build cache locally or in a build container
export CACHE_DIR=/local/cache
export THEMATIC_VOCAB_SIZE_LIMIT=50000
python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()"

# 2. Deploy with pre-built cache (read-only mount)
docker run -e CACHE_DIR=/app/cache \
           -v /local/cache:/app/cache:ro \
           -p 7860:7860 \
           your-crossword-app
```

### Debugging Cache Issues

If cache files are not being created in your container:

1. **Check Health Endpoints:**
```bash
# Basic health check
curl http://localhost:7860/api/health

# Detailed cache status
curl http://localhost:7860/api/health/cache

# Force cache re-initialization
curl -X POST http://localhost:7860/api/health/cache/reinitialize
```

2. **Check Container Logs:**
```bash
docker logs your-container-name
```
Look for cache directory permissions and initialization messages.

3. **Test Cache Directory:**
```bash
# Run test script to verify cache setup
docker exec your-container python test_cache_startup.py
```

4. **Common Issues:**
   - **Permission denied**: Container user can't write to mounted volume
   - **Missing dependencies**: ML libraries not installed in container
   - **Volume not mounted**: Cache directory not properly mounted
   - **Environment variables**: `CACHE_DIR` not set correctly

5. **Fix Permission Issues:**
```bash
# Option 1: Change ownership of host directory
sudo chown -R 1000:1000 /host/cache

# Option 2: Run container with specific user
docker run --user 1000:1000 ...

# Option 3: Set permissions in Dockerfile
RUN mkdir -p /app/cache && chmod 777 /app/cache
```

### Kubernetes Deployment

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: crossword-config
data:
  CACHE_DIR: "/app/cache"
  THEMATIC_VOCAB_SIZE_LIMIT: "50000"
  THEMATIC_MODEL_NAME: "all-mpnet-base-v2"
  NODE_ENV: "production"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: crossword-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crossword-backend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: crossword-backend
  template:
    metadata:
      labels:
        app: crossword-backend
    spec:
      containers:
      - name: backend
        image: your-crossword-app
        envFrom:
        - configMapRef:
            name: crossword-config
        volumeMounts:
        - name: cache-volume
          mountPath: /app/cache
        ports:
        - containerPort: 7860
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: crossword-cache
```

## 🧪 Testing

### Quick Test
```bash
# Basic functionality test (no model download)
python test_local.py
```

### Comprehensive Unit Tests
```bash
# Run all unit tests
python run_tests.py

# Or use pytest directly
pytest tests/ -v

# Run specific test file
python run_tests.py crossword_generator_fixed
pytest tests/test_crossword_generator_fixed.py -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html
```

### Test Structure
- `tests/test_crossword_generator_fixed.py` - Core grid generation logic
- `tests/test_vector_search.py` - Vector similarity search 
- `tests/test_crossword_generator_wrapper.py` - Service wrapper
- `tests/test_api_routes.py` - FastAPI endpoints

### Key Test Features
- ✅ **Index alignment fix**: Tests the list index out of range bug fix
- ✅ **Mocked vector search**: Tests without downloading models
- ✅ **API validation**: Tests all endpoints and error cases
- ✅ **Async support**: Full pytest-asyncio integration
- ✅ **Error handling**: Tests malformed inputs and edge cases

## 📊 Performance Comparison

**Startup Time**:
- JavaScript: ~2 seconds
- Python: ~30-60 seconds (model download + embedding generation)
- Python (with cache): ~5-10 seconds

**Word Quality**:
- JavaScript: Limited by static word lists (~100 words/topic)
- Python: Rich thematic generation from 319K word database

**Memory Usage**:
- JavaScript: ~100MB
- Python: ~500MB-1GB (model + embeddings)
- Cache Size: ~50-200MB per 50K vocabulary

**API Response Time**:
- JavaScript: ~100ms (static word lookup)
- Python: ~200-500ms (semantic similarity computation)

**Cache Performance**:
- Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation
- Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation

## 🔄 Migration Strategy

1. **Phase 1** ✅: Basic Python backend structure
2. **Phase 2**: Test vector search functionality  
3. **Phase 3**: Docker deployment and production testing
4. **Phase 4**: Compare with JavaScript backend
5. **Phase 5**: Production switch with rollback capability

## 🎯 Next Steps

- [x] Replace vector search with thematic word generation
- [x] Implement environment variable cache configuration  
- [x] Add 10-tier difficulty system based on word frequency
- [ ] Optimize embedding computation performance
- [ ] Add more sophisticated crossword grid generation
- [ ] Implement LLM-based clue generation
- [ ] Add cache warming strategies for production deployment