vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6
|
raw
history blame
16.2 kB

Python Backend with Thematic AI Word Generation

This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings.

πŸš€ Features

  • Thematic Word Generation: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary
  • 319K+ Word Database: Comprehensive vocabulary from WordFreq with frequency data
  • 10-Tier Difficulty System: Smart word selection based on frequency tiers
  • Environment Variable Configuration: Flexible cache and model configuration
  • FastAPI: Modern, fast Python web framework
  • Same API: Compatible with existing React frontend

πŸ”„ Differences from JavaScript Backend

Feature JavaScript Backend Python Backend
Word Generation Static word lists Thematic AI word generation from 319K vocabulary
Vocabulary Size ~100 words per topic Filtered from 319K WordFreq database
AI Approach Basic filtering Semantic similarity with frequency tiers
Performance Fast but limited Slower startup, richer word selection
Dependencies Node.js + static files Python + ML libraries

πŸ› οΈ Setup & Installation

Prerequisites

  • Python 3.11+ (3.11 recommended for Docker compatibility)
  • pip (Python package manager)

Basic Setup (Core Functionality)

# Clone and navigate to backend directory
cd crossword-app/backend-py

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install core dependencies
pip install -r requirements.txt

# Start the server
python app.py

Full Development Setup (with AI features)

# Install development dependencies including AI/ML libraries
pip install -r requirements-dev.txt

# This includes:
# - All core dependencies
# - AI/ML libraries (torch, sentence-transformers, etc.)
# - Development tools (pytest, coverage, etc.)

Requirements Files

  • requirements.txt: Core dependencies for basic functionality
  • requirements-dev.txt: Full development environment with AI features

Note: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use requirements.txt only.

Python Version: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility.

πŸ“ Structure

backend-py/
β”œβ”€β”€ app.py                          # FastAPI application entry point
β”œβ”€β”€ requirements.txt                # Core Python dependencies
β”œβ”€β”€ requirements-dev.txt            # Full development dependencies
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ thematic_word_service.py    # Thematic AI word generation
β”‚   β”‚   β”œβ”€β”€ crossword_generator.py      # Puzzle generation logic
β”‚   β”‚   └── crossword_generator_wrapper.py  # Service wrapper
β”‚   └── routes/
β”‚       └── api.py                      # API endpoints (matches JS backend)
β”œβ”€β”€ test-unit/                      # Unit tests (pytest framework) - 5 files
β”‚   β”œβ”€β”€ test_crossword_generator.py
β”‚   β”œβ”€β”€ test_api_routes.py
β”‚   └── test_vector_search.py
β”œβ”€β”€ test-integration/               # Integration tests (standalone scripts) - 16 files
β”‚   β”œβ”€β”€ test_simple_generation.py
β”‚   β”œβ”€β”€ test_boundary_fix.py
β”‚   └── test_local.py               # (+ 13 more test files)
β”œβ”€β”€ data/ -> ../backend/data/       # Symlink to shared word data
└── public/                         # Frontend static files (copied during build)

πŸ›  Dependencies

Core ML Stack

  • sentence-transformers: Local model loading and embeddings
  • wordfreq: 319K word vocabulary with frequency data
  • torch: PyTorch for model inference
  • scikit-learn: Cosine similarity and clustering
  • numpy: Vector operations

Web Framework

  • fastapi: Modern Python web framework
  • uvicorn: ASGI server
  • pydantic: Data validation

Testing

  • pytest: Testing framework
  • pytest-asyncio: Async test support

πŸ§ͺ Testing

πŸ“ Test Organization (Reorganized for Clarity)

We've reorganized the test structure for better developer experience:

Test Type Location Purpose Framework Count
Unit Tests test-unit/ Test individual components in isolation pytest 5 files
Integration Tests test-integration/ Test complete workflows end-to-end Standalone scripts 16 files

Benefits of this structure:

  • βœ… Clear separation between unit and integration testing
  • βœ… Intuitive naming - developers immediately understand test types
  • βœ… Better tooling - can run different test types independently
  • βœ… Easier maintenance - organized by testing strategy

Note: Previously tests were mixed in tests/ folder and root-level test_*.py files. The new structure provides much better organization.

Unit Tests Details (test-unit/)

What they test: Individual components with mocking and isolation

  • test_crossword_generator.py - Core crossword generation logic
  • test_api_routes.py - FastAPI endpoint handlers
  • test_crossword_generator_wrapper.py - Service wrapper layer
  • test_index_bug_fix.py - Specific bug fix validations
  • test_vector_search.py - AI vector search functionality (requires torch)

Run Unit Tests (Formal Test Suite)

# Run all unit tests
python run_tests.py

# Run specific test modules  
python run_tests.py crossword_generator
pytest test-unit/test_crossword_generator.py -v

# Run core tests (excluding AI dependencies)
pytest test-unit/ -v --ignore=test-unit/test_vector_search.py

# Run individual unit test classes
pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v

Integration Tests Details (test-integration/)

What they test: Complete workflows without mocking - real functionality

  • test_simple_generation.py - End-to-end crossword generation
  • test_boundary_fix.py - Word boundary validation (our major fix!)
  • test_local.py - Local environment and dependencies
  • test_word_boundaries.py - Comprehensive boundary testing
  • test_bounds_comprehensive.py - Advanced bounds checking
  • test_final_validation.py - API integration testing
  • And 10 more specialized feature tests...

Run Integration Tests (End-to-End Scripts)

# Test core functionality
python test-integration/test_simple_generation.py
python test-integration/test_boundary_fix.py
python test-integration/test_local.py

# Test specific features
python test-integration/test_word_boundaries.py
python test-integration/test_bounds_comprehensive.py

# Test API integration
python test-integration/test_final_validation.py

Test Coverage

# Run core tests with coverage (requires requirements-dev.txt)
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term

# Full coverage report (may fail without AI dependencies)
pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py

Test Status

  • βœ… Core crossword generation: 15/19 unit tests passing
  • βœ… Boundary validation: All integration tests passing
  • ⚠️ AI/Vector search: Requires torch dependencies
  • ⚠️ Some async mocking: Minor test infrastructure issues

πŸ”„ Migration Guide (For Existing Developers)

If you had previous commands, update them:

Old Command New Command
pytest tests/ pytest test-unit/
python test_simple_generation.py python test-integration/test_simple_generation.py
pytest tests/ --cov=src pytest test-unit/ --cov=src

All functionality is preserved - just organized better!

πŸ”§ Configuration

Environment Variables

The backend supports flexible configuration via environment variables:

# Cache Configuration
CACHE_DIR=/app/cache                        # Cache directory for all service files
THEMATIC_VOCAB_SIZE_LIMIT=50000            # Maximum vocabulary size (default: 100000)
THEMATIC_MODEL_NAME=all-mpnet-base-v2      # Sentence transformer model

# Core Application Settings  
PORT=7860                                  # Server port
NODE_ENV=production                        # Environment mode

# Optional
LOG_LEVEL=INFO                            # Logging level

Cache Structure

The service creates the following cache files:

{CACHE_DIR}/
β”œβ”€β”€ vocabulary_{size}.pkl              # Processed vocabulary words
β”œβ”€β”€ frequencies_{size}.pkl             # Word frequency data
β”œβ”€β”€ embeddings_{model}_{size}.npy      # Word embeddings
└── sentence-transformers/             # Hugging Face model cache

🎯 Thematic Word Generation Process

  1. Initialization:

    • Load WordFreq vocabulary database (319K words)
    • Filter words for crossword suitability (length, content)
    • Load sentence-transformers model locally
    • Pre-compute embeddings for filtered vocabulary
    • Create 10-tier frequency classification system
  2. Word Generation:

    • Get topic embedding: "Animals" β†’ [768-dim vector]
    • Compute cosine similarity with all vocabulary embeddings
    • Filter by similarity threshold and difficulty tier
    • Filter by crossword-specific criteria (length, etc.)
    • Return top matches with generated clues
  3. Multi-Theme Support:

    • Detect multiple themes using clustering
    • Generate words that relate to combined themes
    • Balance word selection across different topics

πŸ§ͺ Testing

# Local testing (without full vector search)
cd backend-py
python test_local.py

# Start development server
python app.py

🐳 Container Deployment

Docker Run with Cache Configuration

# Basic deployment
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=50000 \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app

# With all configuration options
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=25000 \
           -e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \
           -e NODE_ENV=production \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app

Docker Compose

version: '3.8'
services:
  crossword-backend:
    image: your-crossword-app
    environment:
      - CACHE_DIR=/app/cache
      - THEMATIC_VOCAB_SIZE_LIMIT=50000
      - THEMATIC_MODEL_NAME=all-mpnet-base-v2
      - NODE_ENV=production
    volumes:
      - ./cache:/app/cache
    ports:
      - "7860:7860"
    restart: unless-stopped

Pre-built Cache Strategy (Recommended)

For production deployments, pre-build the cache to avoid long startup times:

# 1. Build cache locally or in a build container
export CACHE_DIR=/local/cache
export THEMATIC_VOCAB_SIZE_LIMIT=50000
python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()"

# 2. Deploy with pre-built cache (read-only mount)
docker run -e CACHE_DIR=/app/cache \
           -v /local/cache:/app/cache:ro \
           -p 7860:7860 \
           your-crossword-app

Debugging Cache Issues

If cache files are not being created in your container:

  1. Check Health Endpoints:
# Basic health check
curl http://localhost:7860/api/health

# Detailed cache status
curl http://localhost:7860/api/health/cache

# Force cache re-initialization
curl -X POST http://localhost:7860/api/health/cache/reinitialize
  1. Check Container Logs:
docker logs your-container-name

Look for cache directory permissions and initialization messages.

  1. Test Cache Directory:
# Run test script to verify cache setup
docker exec your-container python test_cache_startup.py
  1. Common Issues:

    • Permission denied: Container user can't write to mounted volume
    • Missing dependencies: ML libraries not installed in container
    • Volume not mounted: Cache directory not properly mounted
    • Environment variables: CACHE_DIR not set correctly
  2. Fix Permission Issues:

# Option 1: Change ownership of host directory
sudo chown -R 1000:1000 /host/cache

# Option 2: Run container with specific user
docker run --user 1000:1000 ...

# Option 3: Set permissions in Dockerfile
RUN mkdir -p /app/cache && chmod 777 /app/cache

Kubernetes Deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: crossword-config
data:
  CACHE_DIR: "/app/cache"
  THEMATIC_VOCAB_SIZE_LIMIT: "50000"
  THEMATIC_MODEL_NAME: "all-mpnet-base-v2"
  NODE_ENV: "production"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: crossword-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crossword-backend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: crossword-backend
  template:
    metadata:
      labels:
        app: crossword-backend
    spec:
      containers:
      - name: backend
        image: your-crossword-app
        envFrom:
        - configMapRef:
            name: crossword-config
        volumeMounts:
        - name: cache-volume
          mountPath: /app/cache
        ports:
        - containerPort: 7860
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: crossword-cache

πŸ§ͺ Testing

Quick Test

# Basic functionality test (no model download)
python test_local.py

Comprehensive Unit Tests

# Run all unit tests
python run_tests.py

# Or use pytest directly
pytest tests/ -v

# Run specific test file
python run_tests.py crossword_generator_fixed
pytest tests/test_crossword_generator_fixed.py -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

Test Structure

  • tests/test_crossword_generator_fixed.py - Core grid generation logic
  • tests/test_vector_search.py - Vector similarity search
  • tests/test_crossword_generator_wrapper.py - Service wrapper
  • tests/test_api_routes.py - FastAPI endpoints

Key Test Features

  • βœ… Index alignment fix: Tests the list index out of range bug fix
  • βœ… Mocked vector search: Tests without downloading models
  • βœ… API validation: Tests all endpoints and error cases
  • βœ… Async support: Full pytest-asyncio integration
  • βœ… Error handling: Tests malformed inputs and edge cases

πŸ“Š Performance Comparison

Startup Time:

  • JavaScript: ~2 seconds
  • Python: ~30-60 seconds (model download + embedding generation)
  • Python (with cache): ~5-10 seconds

Word Quality:

  • JavaScript: Limited by static word lists (~100 words/topic)
  • Python: Rich thematic generation from 319K word database

Memory Usage:

  • JavaScript: ~100MB
  • Python: ~500MB-1GB (model + embeddings)
  • Cache Size: ~50-200MB per 50K vocabulary

API Response Time:

  • JavaScript: ~100ms (static word lookup)
  • Python: ~200-500ms (semantic similarity computation)

Cache Performance:

  • Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation
  • Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation

πŸ”„ Migration Strategy

  1. Phase 1 βœ…: Basic Python backend structure
  2. Phase 2: Test vector search functionality
  3. Phase 3: Docker deployment and production testing
  4. Phase 4: Compare with JavaScript backend
  5. Phase 5: Production switch with rollback capability

🎯 Next Steps

  • Replace vector search with thematic word generation
  • Implement environment variable cache configuration
  • Add 10-tier difficulty system based on word frequency
  • Optimize embedding computation performance
  • Add more sophisticated crossword grid generation
  • Implement LLM-based clue generation
  • Add cache warming strategies for production deployment