Spaces:

vimalk78
/

abc123

Running

App Files Files Community

abc123 / crossword-app /backend-py /README.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 2 months ago

preview code

raw

history blame

16.2 kB

Python Backend with Thematic AI Word Generation

This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings.

🚀 Features

Thematic Word Generation: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary
319K+ Word Database: Comprehensive vocabulary from WordFreq with frequency data
10-Tier Difficulty System: Smart word selection based on frequency tiers
Environment Variable Configuration: Flexible cache and model configuration
FastAPI: Modern, fast Python web framework
Same API: Compatible with existing React frontend

🔄 Differences from JavaScript Backend

Feature	JavaScript Backend	Python Backend
Word Generation	Static word lists	Thematic AI word generation from 319K vocabulary
Vocabulary Size	~100 words per topic	Filtered from 319K WordFreq database
AI Approach	Basic filtering	Semantic similarity with frequency tiers
Performance	Fast but limited	Slower startup, richer word selection
Dependencies	Node.js + static files	Python + ML libraries

🛠️ Setup & Installation

Prerequisites

Python 3.11+ (3.11 recommended for Docker compatibility)
pip (Python package manager)

Basic Setup (Core Functionality)

# Clone and navigate to backend directory
cd crossword-app/backend-py

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install core dependencies
pip install -r requirements.txt

# Start the server
python app.py

Full Development Setup (with AI features)

# Install development dependencies including AI/ML libraries
pip install -r requirements-dev.txt

# This includes:
# - All core dependencies
# - AI/ML libraries (torch, sentence-transformers, etc.)
# - Development tools (pytest, coverage, etc.)

Requirements Files

requirements.txt: Core dependencies for basic functionality
requirements-dev.txt: Full development environment with AI features

Note: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use requirements.txt only.

Python Version: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility.

📁 Structure

backend-py/
├── app.py                          # FastAPI application entry point
├── requirements.txt                # Core Python dependencies
├── requirements-dev.txt            # Full development dependencies
├── src/
│   ├── services/
│   │   ├── thematic_word_service.py    # Thematic AI word generation
│   │   ├── crossword_generator.py      # Puzzle generation logic
│   │   └── crossword_generator_wrapper.py  # Service wrapper
│   └── routes/
│       └── api.py                      # API endpoints (matches JS backend)
├── test-unit/                      # Unit tests (pytest framework) - 5 files
│   ├── test_crossword_generator.py
│   ├── test_api_routes.py
│   └── test_vector_search.py
├── test-integration/               # Integration tests (standalone scripts) - 16 files
│   ├── test_simple_generation.py
│   ├── test_boundary_fix.py
│   └── test_local.py               # (+ 13 more test files)
├── data/ -> ../backend/data/       # Symlink to shared word data
└── public/                         # Frontend static files (copied during build)

🛠 Dependencies

Core ML Stack

sentence-transformers: Local model loading and embeddings
wordfreq: 319K word vocabulary with frequency data
torch: PyTorch for model inference
scikit-learn: Cosine similarity and clustering
numpy: Vector operations

Web Framework

fastapi: Modern Python web framework
uvicorn: ASGI server
pydantic: Data validation

Testing

pytest: Testing framework
pytest-asyncio: Async test support

🧪 Testing

📁 Test Organization (Reorganized for Clarity)

We've reorganized the test structure for better developer experience:

Test Type	Location	Purpose	Framework	Count
Unit Tests	`test-unit/`	Test individual components in isolation	pytest	5 files
Integration Tests	`test-integration/`	Test complete workflows end-to-end	Standalone scripts	16 files

Benefits of this structure:

✅ Clear separation between unit and integration testing
✅ Intuitive naming - developers immediately understand test types
✅ Better tooling - can run different test types independently
✅ Easier maintenance - organized by testing strategy

Note: Previously tests were mixed in tests/ folder and root-level test_*.py files. The new structure provides much better organization.

Unit Tests Details (`test-unit/`)

What they test: Individual components with mocking and isolation

test_crossword_generator.py - Core crossword generation logic
test_api_routes.py - FastAPI endpoint handlers
test_crossword_generator_wrapper.py - Service wrapper layer
test_index_bug_fix.py - Specific bug fix validations
test_vector_search.py - AI vector search functionality (requires torch)

Run Unit Tests (Formal Test Suite)

# Run all unit tests
python run_tests.py

# Run specific test modules  
python run_tests.py crossword_generator
pytest test-unit/test_crossword_generator.py -v

# Run core tests (excluding AI dependencies)
pytest test-unit/ -v --ignore=test-unit/test_vector_search.py

# Run individual unit test classes
pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v

Integration Tests Details (`test-integration/`)

What they test: Complete workflows without mocking - real functionality

test_simple_generation.py - End-to-end crossword generation
test_boundary_fix.py - Word boundary validation (our major fix!)
test_local.py - Local environment and dependencies
test_word_boundaries.py - Comprehensive boundary testing
test_bounds_comprehensive.py - Advanced bounds checking
test_final_validation.py - API integration testing
And 10 more specialized feature tests...

Run Integration Tests (End-to-End Scripts)

# Test core functionality
python test-integration/test_simple_generation.py
python test-integration/test_boundary_fix.py
python test-integration/test_local.py

# Test specific features
python test-integration/test_word_boundaries.py
python test-integration/test_bounds_comprehensive.py

# Test API integration
python test-integration/test_final_validation.py

Test Coverage

# Run core tests with coverage (requires requirements-dev.txt)
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html
pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term

# Full coverage report (may fail without AI dependencies)
pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py

Test Status

✅ Core crossword generation: 15/19 unit tests passing
✅ Boundary validation: All integration tests passing
⚠️ AI/Vector search: Requires torch dependencies
⚠️ Some async mocking: Minor test infrastructure issues

🔄 Migration Guide (For Existing Developers)

If you had previous commands, update them:

Old Command	New Command
`pytest tests/`	`pytest test-unit/`
`python test_simple_generation.py`	`python test-integration/test_simple_generation.py`
`pytest tests/ --cov=src`	`pytest test-unit/ --cov=src`

All functionality is preserved - just organized better!

🔧 Configuration

Environment Variables

The backend supports flexible configuration via environment variables:

# Cache Configuration
CACHE_DIR=/app/cache                        # Cache directory for all service files
THEMATIC_VOCAB_SIZE_LIMIT=50000            # Maximum vocabulary size (default: 100000)
THEMATIC_MODEL_NAME=all-mpnet-base-v2      # Sentence transformer model

# Core Application Settings  
PORT=7860                                  # Server port
NODE_ENV=production                        # Environment mode

# Optional
LOG_LEVEL=INFO                            # Logging level

Cache Structure

The service creates the following cache files:

{CACHE_DIR}/
├── vocabulary_{size}.pkl              # Processed vocabulary words
├── frequencies_{size}.pkl             # Word frequency data
├── embeddings_{model}_{size}.npy      # Word embeddings
└── sentence-transformers/             # Hugging Face model cache

🎯 Thematic Word Generation Process

Initialization:
- Load WordFreq vocabulary database (319K words)
- Filter words for crossword suitability (length, content)
- Load sentence-transformers model locally
- Pre-compute embeddings for filtered vocabulary
- Create 10-tier frequency classification system
Word Generation:
- Get topic embedding: "Animals" → [768-dim vector]
- Compute cosine similarity with all vocabulary embeddings
- Filter by similarity threshold and difficulty tier
- Filter by crossword-specific criteria (length, etc.)
- Return top matches with generated clues
Multi-Theme Support:
- Detect multiple themes using clustering
- Generate words that relate to combined themes
- Balance word selection across different topics

🧪 Testing

# Local testing (without full vector search)
cd backend-py
python test_local.py

# Start development server
python app.py

🐳 Container Deployment

Docker Run with Cache Configuration

# Basic deployment
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=50000 \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app

# With all configuration options
docker run -e CACHE_DIR=/app/cache \
           -e THEMATIC_VOCAB_SIZE_LIMIT=25000 \
           -e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \
           -e NODE_ENV=production \
           -v /host/cache:/app/cache \
           -p 7860:7860 \
           your-crossword-app

Docker Compose

version: '3.8'
services:
  crossword-backend:
    image: your-crossword-app
    environment:
      - CACHE_DIR=/app/cache
      - THEMATIC_VOCAB_SIZE_LIMIT=50000
      - THEMATIC_MODEL_NAME=all-mpnet-base-v2
      - NODE_ENV=production
    volumes:
      - ./cache:/app/cache
    ports:
      - "7860:7860"
    restart: unless-stopped

Pre-built Cache Strategy (Recommended)

For production deployments, pre-build the cache to avoid long startup times:

# 1. Build cache locally or in a build container
export CACHE_DIR=/local/cache
export THEMATIC_VOCAB_SIZE_LIMIT=50000
python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()"

# 2. Deploy with pre-built cache (read-only mount)
docker run -e CACHE_DIR=/app/cache \
           -v /local/cache:/app/cache:ro \
           -p 7860:7860 \
           your-crossword-app

Debugging Cache Issues

If cache files are not being created in your container:

Check Health Endpoints:

# Basic health check
curl http://localhost:7860/api/health

# Detailed cache status
curl http://localhost:7860/api/health/cache

# Force cache re-initialization
curl -X POST http://localhost:7860/api/health/cache/reinitialize

Check Container Logs:

docker logs your-container-name

Look for cache directory permissions and initialization messages.

Test Cache Directory:

# Run test script to verify cache setup
docker exec your-container python test_cache_startup.py

Common Issues:
- Permission denied: Container user can't write to mounted volume
- Missing dependencies: ML libraries not installed in container
- Volume not mounted: Cache directory not properly mounted
- Environment variables: CACHE_DIR not set correctly
Fix Permission Issues:

# Option 1: Change ownership of host directory
sudo chown -R 1000:1000 /host/cache

# Option 2: Run container with specific user
docker run --user 1000:1000 ...

# Option 3: Set permissions in Dockerfile
RUN mkdir -p /app/cache && chmod 777 /app/cache

Kubernetes Deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: crossword-config
data:
  CACHE_DIR: "/app/cache"
  THEMATIC_VOCAB_SIZE_LIMIT: "50000"
  THEMATIC_MODEL_NAME: "all-mpnet-base-v2"
  NODE_ENV: "production"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: crossword-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crossword-backend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: crossword-backend
  template:
    metadata:
      labels:
        app: crossword-backend
    spec:
      containers:
      - name: backend
        image: your-crossword-app
        envFrom:
        - configMapRef:
            name: crossword-config
        volumeMounts:
        - name: cache-volume
          mountPath: /app/cache
        ports:
        - containerPort: 7860
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: crossword-cache

🧪 Testing

Quick Test

# Basic functionality test (no model download)
python test_local.py

Comprehensive Unit Tests

# Run all unit tests
python run_tests.py

# Or use pytest directly
pytest tests/ -v

# Run specific test file
python run_tests.py crossword_generator_fixed
pytest tests/test_crossword_generator_fixed.py -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

Test Structure

tests/test_crossword_generator_fixed.py - Core grid generation logic
tests/test_vector_search.py - Vector similarity search
tests/test_crossword_generator_wrapper.py - Service wrapper
tests/test_api_routes.py - FastAPI endpoints

Key Test Features

✅ Index alignment fix: Tests the list index out of range bug fix
✅ Mocked vector search: Tests without downloading models
✅ API validation: Tests all endpoints and error cases
✅ Async support: Full pytest-asyncio integration
✅ Error handling: Tests malformed inputs and edge cases

📊 Performance Comparison

Startup Time:

JavaScript: ~2 seconds
Python: ~30-60 seconds (model download + embedding generation)
Python (with cache): ~5-10 seconds

Word Quality:

JavaScript: Limited by static word lists (~100 words/topic)
Python: Rich thematic generation from 319K word database

Memory Usage:

JavaScript: ~100MB
Python: ~500MB-1GB (model + embeddings)
Cache Size: ~50-200MB per 50K vocabulary

API Response Time:

JavaScript: ~100ms (static word lookup)
Python: ~200-500ms (semantic similarity computation)

Cache Performance:

Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation
Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation

🔄 Migration Strategy

Phase 1 ✅: Basic Python backend structure
Phase 2: Test vector search functionality
Phase 3: Docker deployment and production testing
Phase 4: Compare with JavaScript backend
Phase 5: Production switch with rollback capability

🎯 Next Steps

Replace vector search with thematic word generation
Implement environment variable cache configuration
Add 10-tier difficulty system based on word frequency
Optimize embedding computation performance
Add more sophisticated crossword grid generation
Implement LLM-based clue generation
Add cache warming strategies for production deployment

Python Backend with Thematic AI Word Generation

🚀 Features

🔄 Differences from JavaScript Backend

🛠️ Setup & Installation

Prerequisites

Basic Setup (Core Functionality)

Full Development Setup (with AI features)

Requirements Files

📁 Structure

🛠 Dependencies

Core ML Stack

Web Framework

Testing

🧪 Testing

📁 Test Organization (Reorganized for Clarity)

Unit Tests Details (test-unit/)

Run Unit Tests (Formal Test Suite)

Integration Tests Details (test-integration/)

Run Integration Tests (End-to-End Scripts)

Test Coverage

Test Status

🔄 Migration Guide (For Existing Developers)

🔧 Configuration

Environment Variables

Cache Structure

🎯 Thematic Word Generation Process

🧪 Testing

🐳 Container Deployment

Docker Run with Cache Configuration

Docker Compose

Pre-built Cache Strategy (Recommended)

Debugging Cache Issues

Kubernetes Deployment

🧪 Testing

Quick Test

Comprehensive Unit Tests

Test Structure

Key Test Features

📊 Performance Comparison

🔄 Migration Strategy

🎯 Next Steps

Unit Tests Details (`test-unit/`)

Integration Tests Details (`test-integration/`)